# **Data Wrangling (Part 1) Los Angeles County Urban Trees**

**Summary**

The purpose of this notebook is to clean raw urban trees data. The notebook shows the steps taken to prepare the raw dataset for exploratory data analysis and statistical analysis. A brief summary of the content of this notebook is below:

**Importing Data**

LA county data on urban trees were imported from https://github.com/stiles/data/tree/master/los-angeles-street-trees

**Cleaning Data**

Urban trees Data (Starting with 1.7 Million Trees) 

i. Removing counties dataset which has inaccurate geometry information. (49K Trees) 

ii. Removing columns with 100% missing values. 

iii. Exporting to CSV to reduce Google Colab RAM usage

In [1]:
#This notebook is done google colab.

#Mounting your google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
#Importing relevant packages
import pandas as pd
import requests
from bs4 import BeautifulSoup

!pip install geopandas
import geopandas as gpd

Collecting geopandas
[?25l  Downloading https://files.pythonhosted.org/packages/f7/a4/e66aafbefcbb717813bf3a355c8c4fc3ed04ea1dd7feb2920f2f4f868921/geopandas-0.8.1-py2.py3-none-any.whl (962kB)
[K     |████████████████████████████████| 972kB 2.2MB/s 
[?25hCollecting pyproj>=2.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/e4/ab/280e80a67cfc109d15428c0ec56391fc03a65857b7727cf4e6e6f99a4204/pyproj-3.0.0.post1-cp36-cp36m-manylinux2010_x86_64.whl (6.4MB)
[K     |████████████████████████████████| 6.5MB 21.2MB/s 
[?25hCollecting fiona
[?25l  Downloading https://files.pythonhosted.org/packages/36/8b/e8b2c11bed5373c8e98edb85ce891b09aa1f4210fd451d0fb3696b7695a2/Fiona-1.8.17-cp36-cp36m-manylinux1_x86_64.whl (14.8MB)
[K     |████████████████████████████████| 14.8MB 308kB/s 
Collecting munch
  Downloading https://files.pythonhosted.org/packages/cc/ab/85d8da5c9a45e072301beb37ad7f833cd344e04c817d97e0cc75681d248f/munch-2.5.0-py2.py3-none-any.whl
Collecting cligj>=0.5
  Downloadin

In [6]:
#Urban trees data are stored separately according to 50 counties of Los Angeles.
#Big thank you to Matt Stiles for the urban trees data.

#Here we scrape all the file names that contains the urban tree data using BeautifulSoup.
page = requests.get(f'https://github.com/stiles/data/tree/master/los-angeles-street-trees/all')
soup = BeautifulSoup(page.content, 'html.parser')
container = soup.find(id="js-repo-pjax-container")
title_tag = container.select("[title]")
titles = [pt.get_text() for pt in title_tag]
titles = titles[8:]

#Cleaning the list of scrape names
titles.remove("script.sh")
titles.remove("updates to trump tweets")
titles.remove("\n.\u200a.\n")

#pasadena is removed from this list because of conflicting column names. We will deal with this later.
titles.remove("pasadena.geojson") 

#Viewing number of total files (should be 49)
print(len(titles))

49


In [7]:
#Creating an empty pandas dataframe
treesdf = pd.DataFrame()

#Looping through the list of file names and merging the geojson files into one pandas dataframe
for title in titles:
  url = f"https://github.com/stiles/data/blob/master/los-angeles-street-trees/all/{title}?raw=True"
  #Opening a geojson file as a pandas dataframe
  onecitydf = gpd.read_file(url)
  #Adding a column to reflect the respective county of the dataset
  onecitydf["whichcity"] = title
  #Cleaning the column names
  onecitydf.columns = onecitydf.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')
  #Data from 'lat','lon','objectid' were saved in different formats in different datasets. Here they were removed so that datasets can be merged without issues.
  if 'lat' in onecitydf.columns:
    onecitydf = onecitydf.drop('lat',axis=1)
  if 'lon' in onecitydf.columns:
    onecitydf = onecitydf.drop('lon',axis=1)
  if 'objectid' in onecitydf.columns:
    onecitydf = onecitydf.drop('objectid',axis=1)
  treesdf = treesdf.append(onecitydf)
  print(f"Done with {title}")

Done with agoura-hills.geojson
Done with alhambra.geojson
Done with arcadia.geojson
Done with artesia.geojson
Done with bell-gardens.geojson
Done with bellflower.geojson
Done with beverly-hills.geojson
Done with burbank.geojson
Done with carson.geojson
Done with cerritos.geojson
Done with covina.geojson
Done with culver-city.geojson
Done with diamond-bar.geojson
Done with downey.geojson
Done with duarte.geojson
Done with el-monte.geojson
Done with el-segundo.geojson
Done with glendale.geojson
Done with glendora.geojson
Done with inglewood.geojson
Done with la-mirada.geojson
Done with la-verne.geojson
Done with lancaster.geojson
Done with lawndale.geojson
Done with lomita.geojson
Done with long-beach.geojson
Done with los-angeles-city.geojson
Done with los-angeles-county.geojson
Done with malibu.geojson
Done with mandarin-orange-trees.geojson
Done with paramount.geojson
Done with pomona.geojson
Done with rancho-palos-verdes.geojson
Done with redondo-beach.geojson
Done with san-dimas.geo

In [8]:
#Here we fix the issues with pasadena so it can be merged with the treesdf
url = 'https://github.com/stiles/data/blob/master/los-angeles-street-trees/all/pasadena.geojson?raw=True'
pasadena = gpd.read_file(url)

#Cleaning the columns
pasadena["botanical"] = pasadena["Genus"] + " " + pasadena["Species"]
pasadena = pasadena.drop(['Genus','Species'],axis=1)
#Adding a column to reflect the respective county of the dataset
pasadena["whichcity"] = "pasadena.geojson"
#Cleaning the column names
pasadena.columns = pasadena.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')
#Data from 'lat','lon','objectid' were saved in different formats in different datasets. Here they were removed so that datasets can be merged without issues.
if 'lat' in pasadena.columns:
  pasadena = pasadena.drop('lat',axis=1)
if 'lon' in pasadena.columns:
  pasadena = pasadena.drop('lon',axis=1)
if 'objectid' in pasadena.columns:
  pasadena = pasadena.drop('objectid',axis=1)

#appending pasadena to treesdf
treesdf = treesdf.append(pasadena)

In [9]:
#Cleaning the main df
#The col 'city' in the original datasets contains inconsistent information. It will be removed and replaced with data from 'whichcity'.
treesdf = treesdf.drop('city',axis=1)
treesdf['city'] = treesdf['whichcity'].str.replace(".geojson","")
treesdf = treesdf.drop('whichcity',axis=1)

In [10]:
#Overview of columns and size of dataframe
treesdf.info(verbose=True, null_counts=True)

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 1673610 entries, 0 to 71131
Data columns (total 225 columns):
 #   Column               Non-Null Count    Dtype   
---  ------               --------------    -----   
 0   inventoryid          139884 non-null   float64 
 1   district             267675 non-null   object  
 2   address              668706 non-null   float64 
 3   fictitious           192187 non-null   object  
 4   street               527881 non-null   object  
 5   sidetype             258325 non-null   object  
 6   tree                 430634 non-null   float64 
 7   onaddress            278577 non-null   float64 
 8   onstreet             287995 non-null   object  
 9   species              1519523 non-null  object  
 10  botanical            469450 non-null   object  
 11  dbh                  575187 non-null   object  
 12  height               736006 non-null   object  
 13  parkwaytype          105215 non-null   object  
 14  geometry             167361

In [11]:
#Dropping columns that are empty
treesdf = treesdf.dropna(how='all', axis=1)

In [None]:
#Export trees file to a csv, which we will import during analysis. This will prevent Google Colab from using too much RAM, leading to crashes.
treesdf.to_csv('LAtrees.csv')
!cp LAtrees.csv "/content/drive/My Drive/Public Trees - Wes/"