# **Examining childcare data**


I start by importing a new module called geopandas to my notebook.

In [1]:
import geopandas as gpd

Next, I need to import data. Here, I have downloaded data on childcare and familychild care from Department of Social Services."https://www.ccld.dss.ca.gov/carefacilitysearch/DownloadData
The data is in csv format.

But, I first want to know where my file is located. 

In [2]:
pwd

'/home/jovyan/up206a/Weeks/Week02/Childcare data'

Now I use relative path to point to where data for childcare is located using :/

In [3]:
childcare = gpd.read_file('../Childcare data/childcarecenters.csv')

Next, I want to know data type for childcare data. 

In [4]:
type(childcare)

geopandas.geodataframe.GeoDataFrame

But, what does my data actually even look like?

In [8]:
childcare.head()

Unnamed: 0,Facility Type,Facility Number,Facility Name,Licensee,Facility Administrator,Facility Telephone Number,Facility Address,Facility City,Facility State,Facility Zip,...,POC Dates,All Visit Dates,Inspection Visit Dates,Inspect TypeA,Inspect TypeB,Other Visit Dates,Other TypeA,Other TypeB,"Complaint Info- Date, #Sub Aleg, # Inc Aleg, # Uns Aleg, # TypeA, # TypeB ...",geometry
0,DAY CARE CENTER,197416900,107TH. STREET ELEMENTARY SCHOOL CSPP/HEAD START,LAUSD EARLY CHILDHOOD EDUCATON,"RIOS, REUBEN M.",(323) 756-8137,146 EAST 107TH. STREET ROOM K4,LOS ANGELES,CA,90003,...,,"03/26/2019, 03/26/2019, 05/06/2016","03/26/2019, 05/06/2016",0,0,03/26/2019,0,0,No Complaints,
1,DAY CARE CENTER,191607790,10TH STREET PRESCHOOL,"WASSON, CINDY LEE","WASSON, CINDY",(310) 458-4088,1444 10TH STREET,SANTA MONICA,CA,90401,...,08/29/2019,"07/29/2019, 01/25/2018, 03/11/2016, 03/11/2016","07/29/2019, 01/25/2018",0,1,,0,0,03/14/2016,
2,DAY CARE CENTER,197416698,186TH STREET ELEMENTARY SCHOOL CSPP,LAUSD/EARLY CHILDHOOD EDUCATION,"REED, MARCIA S.",(310) 324-1153,1581 WEST 186TH STREET ROOM 9,GARDENA,CA,90248,...,,12/13/2016,12/13/2016,0,0,,0,0,No Complaints,
3,DAY CARE CENTER,197493820,1ST CLASS PREPARATORY PRESCHOOL,1ST CLASS PREPARATORY,MARRISCHIA DAVIS,(310) 925-6394,3459 MC MANUS,CULVER CITY,CA,90232,...,,10/11/2018,,0,0,10/11/2018,0,0,No Complaints,
4,DAY CARE CENTER,384001195,1ST PLACE 2 START,1ST PLACE 2 START,SANDRA DAVIS,(415) 333-2659,1252 SUNNYDALE AVE,SAN FRANCISCO,CA,94134,...,"08/17/2016, 08/17/2016","09/12/2019, 10/24/2017, 08/24/2016, 08/03/2016","09/12/2019, 10/24/2017, 08/03/2016",0,2,08/24/2016,0,0,No Complaints,


Next, I see what a sample of the childcare dataset looks like. 

In [5]:
childcare.sample()

Unnamed: 0,Facility Type,Facility Number,Facility Name,Licensee,Facility Administrator,Facility Telephone Number,Facility Address,Facility City,Facility State,Facility Zip,...,POC Dates,All Visit Dates,Inspection Visit Dates,Inspect TypeA,Inspect TypeB,Other Visit Dates,Other TypeA,Other TypeB,"Complaint Info- Date, #Sub Aleg, # Inc Aleg, # Uns Aleg, # TypeA, # TypeB ...",geometry
9739,DAY CARE CENTER,197412046,RAINBOW EARLY LEARNING CENTER,"SMITH, BRYAN & SMITH, TOKE",TOKE SMITH,(818) 993-0424,20819 PARTHENIA STREET,WINNETKA,CA,91306,...,,"02/12/2019, 11/28/2018, 05/10/2017, 03/28/2016...","11/28/2018, 05/10/2017, 03/22/2016",0,0,03/28/2016,0,0,02/14/2019,


Now that I know what variables are in this dataset, let's see what data types I have.

In [6]:
childcare.dtypes

Facility Type                                                                      object
Facility Number                                                                    object
Facility Name                                                                      object
Licensee                                                                           object
Facility Administrator                                                             object
Facility Telephone Number                                                          object
Facility Address                                                                   object
Facility City                                                                      object
Facility State                                                                     object
Facility Zip                                                                       object
County Name                                                                        object
Regional O

There definitely was another command to get this info. It is the info command

In [7]:
childcare.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 19005 entries, 0 to 19004
Data columns (total 32 columns):
 #   Column                                                                         Non-Null Count  Dtype   
---  ------                                                                         --------------  -----   
 0   Facility Type                                                                  19005 non-null  object  
 1   Facility Number                                                                19005 non-null  object  
 2   Facility Name                                                                  19005 non-null  object  
 3   Licensee                                                                       19005 non-null  object  
 4   Facility Administrator                                                         19005 non-null  object  
 5   Facility Telephone Number                                                      19005 non-null  object  
 6   Facili

So, most of our data type is object, aka string. That's a pretty big dataset. But, how big. How many rows and columns are there?

In [8]:
childcare.shape

(19005, 32)

We know that this dataset contrains 32 columns and 19,005 rows. But, that's too many columns. I definitely do not need information on inspection visits, complaint visits, and all of that. Let's trim our data. 

In [9]:
# list of desired column names
desired_columns = ['Facility Type','Facility Number','Facility Name','Licensee', 'Facility City','Facility Address', 'Facility City', 'Facility State', 'Facility Zip', 'County Name']

# subset based on desired columns
childcare[desired_columns]

Unnamed: 0,Facility Type,Facility Number,Facility Name,Licensee,Facility City,Facility Address,Facility City.1,Facility State,Facility Zip,County Name
0,DAY CARE CENTER,197416900,107TH. STREET ELEMENTARY SCHOOL CSPP/HEAD START,LAUSD EARLY CHILDHOOD EDUCATON,LOS ANGELES,146 EAST 107TH. STREET ROOM K4,LOS ANGELES,CA,90003,LOS ANGELES
1,DAY CARE CENTER,191607790,10TH STREET PRESCHOOL,"WASSON, CINDY LEE",SANTA MONICA,1444 10TH STREET,SANTA MONICA,CA,90401,LOS ANGELES
2,DAY CARE CENTER,197416698,186TH STREET ELEMENTARY SCHOOL CSPP,LAUSD/EARLY CHILDHOOD EDUCATION,GARDENA,1581 WEST 186TH STREET ROOM 9,GARDENA,CA,90248,LOS ANGELES
3,DAY CARE CENTER,197493820,1ST CLASS PREPARATORY PRESCHOOL,1ST CLASS PREPARATORY,CULVER CITY,3459 MC MANUS,CULVER CITY,CA,90232,LOS ANGELES
4,DAY CARE CENTER,384001195,1ST PLACE 2 START,1ST PLACE 2 START,SAN FRANCISCO,1252 SUNNYDALE AVE,SAN FRANCISCO,CA,94134,SAN FRANCISCO
...,...,...,...,...,...,...,...,...,...,...
19000,SCHOOL AGE DAY CARE CENTER,070211880,YWCA OF CONTRA COSTA - HIDDEN VALLEY,YWCA OF CONTRA COSTA COUNTY,MARTINEZ,510 GLACIER,MARTINEZ,CA,94553,CONTRA COSTA
19001,SCHOOL AGE DAY CARE CENTER,198003136,Y.M.C.A GLB LOS ALTOS,Y.M.C.A. GLB LOS ALTOS,LONG BEACH,1720 BELLFLOWER,LONG BEACH,CA,90815,LOS ANGELES
19002,SCHOOL AGE DAY CARE CENTER,376700498,ZAMORANO KLASSIC KIDS,HARMONIUM INC.,SAN DIEGO,2655 CASEY STREET,SAN DIEGO,CA,92139,SAN DIEGO
19003,SCHOOL AGE DAY CARE CENTER,304370618,ZIGGURAT CHILD DEVELOPMENT CENTER,CHILDREN'S CREATIVE LEARNING CENTERS LLC,LAGUNA NIGUEL,24000 AVILA ROAD,LAGUNA NIGUEL,CA,92677,ORANGE


These outputs are temporary. So, we need to save my trimmed content. Let's create a new variable for the trimmed data.

In [10]:
childcaretrim=childcare[desired_columns]
childcaretrim

Unnamed: 0,Facility Type,Facility Number,Facility Name,Licensee,Facility City,Facility Address,Facility City.1,Facility State,Facility Zip,County Name
0,DAY CARE CENTER,197416900,107TH. STREET ELEMENTARY SCHOOL CSPP/HEAD START,LAUSD EARLY CHILDHOOD EDUCATON,LOS ANGELES,146 EAST 107TH. STREET ROOM K4,LOS ANGELES,CA,90003,LOS ANGELES
1,DAY CARE CENTER,191607790,10TH STREET PRESCHOOL,"WASSON, CINDY LEE",SANTA MONICA,1444 10TH STREET,SANTA MONICA,CA,90401,LOS ANGELES
2,DAY CARE CENTER,197416698,186TH STREET ELEMENTARY SCHOOL CSPP,LAUSD/EARLY CHILDHOOD EDUCATION,GARDENA,1581 WEST 186TH STREET ROOM 9,GARDENA,CA,90248,LOS ANGELES
3,DAY CARE CENTER,197493820,1ST CLASS PREPARATORY PRESCHOOL,1ST CLASS PREPARATORY,CULVER CITY,3459 MC MANUS,CULVER CITY,CA,90232,LOS ANGELES
4,DAY CARE CENTER,384001195,1ST PLACE 2 START,1ST PLACE 2 START,SAN FRANCISCO,1252 SUNNYDALE AVE,SAN FRANCISCO,CA,94134,SAN FRANCISCO
...,...,...,...,...,...,...,...,...,...,...
19000,SCHOOL AGE DAY CARE CENTER,070211880,YWCA OF CONTRA COSTA - HIDDEN VALLEY,YWCA OF CONTRA COSTA COUNTY,MARTINEZ,510 GLACIER,MARTINEZ,CA,94553,CONTRA COSTA
19001,SCHOOL AGE DAY CARE CENTER,198003136,Y.M.C.A GLB LOS ALTOS,Y.M.C.A. GLB LOS ALTOS,LONG BEACH,1720 BELLFLOWER,LONG BEACH,CA,90815,LOS ANGELES
19002,SCHOOL AGE DAY CARE CENTER,376700498,ZAMORANO KLASSIC KIDS,HARMONIUM INC.,SAN DIEGO,2655 CASEY STREET,SAN DIEGO,CA,92139,SAN DIEGO
19003,SCHOOL AGE DAY CARE CENTER,304370618,ZIGGURAT CHILD DEVELOPMENT CENTER,CHILDREN'S CREATIVE LEARNING CENTERS LLC,LAGUNA NIGUEL,24000 AVILA ROAD,LAGUNA NIGUEL,CA,92677,ORANGE


Now that my data is clean. Let's do some queries!!Let's start with facility city.

In [14]:
childcaretrim.loc[childcare['Facility City']=='Los Angeles']

Unnamed: 0,Facility Type,Facility Number,Facility Name,Licensee,Facility City,Facility Address,Facility City.1,Facility State,Facility Zip,County Name


Not sure why this code did not give us an actual subset of facilities in Los Angeles, maybe because the dataset is huge? Let's try another way to query data. 

In [29]:
childcaretrim[childcare.County Name == 'Los Angeles']

SyntaxError: invalid syntax (<ipython-input-29-4e0f1e39908d>, line 1)

Maybe let's try quering zip code to see if that works. 

In [21]:
childcaretrim.query("Facility Zip == '94134'")

SyntaxError: invalid syntax (<unknown>, line 1)

That query also did not work. Let's try something different in the meantime. Let's describe our trimmed data. 

In [22]:
childcaretrim.describe()

Unnamed: 0,Facility Type,Facility Number,Facility Name,Licensee,Facility City,Facility Address,Facility City.1,Facility State,Facility Zip,County Name
count,19005,19005,19005,19005,19005,19005,19005,19005,19005,19005
unique,4,19005,14977,9776,989,14695,989,1,1420,58
top,DAY CARE CENTER,197418903,KINDERCARE LEARNING CENTER,CATALYST FAMILY INC.,LOS ANGELES,1740 PRAIRIE CITY ROAD,LOS ANGELES,CA,95630,LOS ANGELES
freq,13105,1,272,233,1030,12,1030,19005,69,4481


Looks like we have 4 types of facility types and 19,005 facilities in this dataset, of which 9776 are unique licensees. We have data from 58 unique counties and 989 unique cities, which basically here are counties iwthin California. Our dataset includes 1420 zip code data. 

Moving forward, suppose I want to only look at childcare facilities in Los Angeles County and examine that selected dataset. How can I select that subset? Let's try

In [26]:
childcaretrim.query("County Name=='LOS ANGELES'")

SyntaxError: invalid syntax (<unknown>, line 1)

That's an invalid syntax error once again. Ask how to do data query in class or office hour next week. 