<h1 align=center><font size = 5>Real Data Analysis with Chicago Data</font></h1>


# Introduction

Using this Python notebook you will:

1.  Understand 3 Chicago datasets  
2.  Load the 3 datasets into 3 tables in MySQL Database
3.  Execute SQL queries to answer assignment questions 


## Understand the datasets

To complete the assignment problems in this notebook you will be using three datasets that are available on the city of Chicago's Data Portal:

1.  <a href="https://data.cityofchicago.org/Health-Human-Services/Census-Data-Selected-socioeconomic-indicators-in-C/kn9c-c2s2">Socioeconomic Indicators in Chicago</a>
2.  <a href="https://data.cityofchicago.org/Education/Chicago-Public-Schools-Progress-Report-Cards-2011-/9xs2-f89t">Chicago Public Schools</a>
3.  <a href="https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2">Chicago Crime Data</a>

### 1. Socioeconomic Indicators in Chicago

This dataset contains a selection of six socioeconomic indicators of public health significance and a “hardship index,” for each Chicago community area, for the years 2008 – 2012.

For this assignment you will use a snapshot of this dataset which can be downloaded from:<a href="https://ibm.box.com/shared/static/05c3415cbfbtfnr2fx4atenb2sd361ze.csv" target="_blank"> Chicago Census Data </a>

A detailed description of this dataset and the original dataset can be obtained from the Chicago Data Portal at:
[https://data.cityofchicago.org/Health-Human-Services/Census-Data-Selected-socioeconomic-indicators-in-C/kn9c-c2s2](https://data.cityofchicago.org/Health-Human-Services/Census-Data-Selected-socioeconomic-indicators-in-C/kn9c-c2s2?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DB0201EN-SkillsNetwork-20127838&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DB0201EN-SkillsNetwork-20127838&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ)

### 2. Chicago Public Schools

This dataset shows all school level performance data used to create CPS School Report Cards for the 2011-2012 school year. This dataset is provided by the city of Chicago's Data Portal.

For this assignment you will use a snapshot of this dataset which can be downloaded from: <a href="https://ibm.box.com/shared/static/f9gjvj1gjmxxzycdhplzt01qtz0s7ew7.csv" target="_blank"> Chicago Public School </a>

A detailed description of this dataset and the original dataset can be obtained from the Chicago Data Portal at:
[https://data.cityofchicago.org/Education/Chicago-Public-Schools-Progress-Report-Cards-2011-/9xs2-f89t](https://data.cityofchicago.org/Education/Chicago-Public-Schools-Progress-Report-Cards-2011-/9xs2-f89t?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DB0201EN-SkillsNetwork-20127838&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ)

### 3. Chicago Crime Data

This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. 

This dataset is quite large - over 1.5GB in size with over 6.5 million rows. For the purposes of this assignment we will use a much smaller sample of this dataset which can be downloaded from:<a href="https://ibm.box.com/shared/static/svflyugsr9zbqy5bmowgswqemfpm1x7f.csv" target="_blank"> Chicago Crime Data </a>

A detailed description of this dataset and the original dataset can be obtained from the Chicago Data Portal at:
[https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DB0201EN-SkillsNetwork-20127838&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ)


### Download the datasets

In many cases the dataset to be analyzed is available as a .CSV (comma separated values) file, perhaps on the internet. Click on the links below to download and save the datasets (.CSV files):

1.  **CENSUS_DATA:** <a href="https://ibm.box.com/shared/static/05c3415cbfbtfnr2fx4atenb2sd361ze.csv" target="_blank">Chicago Census Dataset</a>

2.  **CHICAGO_PUBLIC_SCHOOLS**  <a href="https://ibm.box.com/shared/static/f9gjvj1gjmxxzycdhplzt01qtz0s7ew7.csv" target="_blank"> Chicago Public School</a>

3.  **CHICAGO_CRIME_DATA:** <a href=" https://ibm.box.com/shared/static/svflyugsr9zbqy5bmowgswqemfpm1x7f.csv" target="_blank"> Chicago Crime Data </a>

**NOTE:** Ensure you have downloaded the datasets using the links above instead of directly from the Chicago Data Portal. The versions linked here are subsets of the original datasets and have some of the column names modified to be more database friendly which will make it easier to complete this assignment.


While it is easier to read the dataset into a Pandas dataframe and then `--persit` it into the database as we saw in the previous lab, it results in mapping to default datatypes which may not be optimal for SQL querying. For example a long textual field may map to a CLOB instead of a VARCHAR. 

Therefore, **it is highly recommended to manually load the table using MySQL `Table Data Import Wizard`**.

##### Now open MySQL Workbench, right click on your Schema and click on `Table Data Import Wizard`. Choose the path for the csv file and click next. Select `Create New Table`, rename table as **SCHOOLS** and ensure `Drop Table if exist` is marked. Click next and next to import the data from the csv file. Please pay attention to the error report and ensure the import result is error free. Name the new tables as folows:

1.  **CHICAGO_CENSUS_DATA**
2.  **CHICAGO_PUBLIC_SCHOOLS**
3.  **CHICAGO_CRIME_DATA**


### Connect to the database

Let us first load the SQL extension and establish a connection with the database


In [1]:
%load_ext sql

In [2]:
# Enter the connection string for your MySQL database below

import os 

from dotenv import load_dotenv
load_dotenv() 

myuser = os.environ.get('mysql_username')      # e.g. 'root'
mypassword= os.environ.get('mysql_password')   # e.g. 'sample-password' 

connection_url = 'mysql://{user}:{password}@localhost/ibm_sql_lab'.format(user=myuser,password=mypassword)

%sql {connection_url}

## Problems

Now write and execute SQL queries to solve assignment problems

### Problem 1

##### Find the total number of crimes recorded in the CRIME table


In [3]:
%%sql

select count(*) as TOTAL_CRIMES_RECORDED
    from CHICAGO_CRIME_DATA;

 * mysql://root:***@localhost/ibm_sql_lab
1 rows affected.


TOTAL_CRIMES_RECORDED
533


### Problem 2

##### Retrieve first 10 rows from the CRIME table


In [4]:
%%sql

select *
    from CHICAGO_CRIME_DATA
    limit 10

 * mysql://root:***@localhost/ibm_sql_lab
10 rows affected.


ID,CASE_NUMBER,DATE,BLOCK,IUCR,PRIMARY_TYPE,DESCRIPTION,LOCATION_DESCRIPTION,ARREST,DOMESTIC,BEAT,DISTRICT,WARD,COMMUNITY_AREA_NUMBER,FBICODE,X_COORDINATE,Y_COORDINATE,YEAR,UPDATEDON,LATITUDE,LONGITUDE,LOCATION
21149,HW519443,2013-11-03 19:27:00,044XX S RICHMOND ST,110,HOMICIDE,FIRST DEGREE MURDER,HOUSE,1,1,922,9,14.0,58.0,01A,1157439,1875086,2013,2016-08-05 15:48:24,41.81299523,-87.69802859,"(41.812995227, -87.698028592)"
23469,JA359626,2017-07-23 09:25:00,015XX E 82ND ST,110,HOMICIDE,FIRST DEGREE MURDER,STREET,0,0,411,4,8.0,45.0,01A,1188090,1850923,2017,2017-07-30 15:51:44,41.74601319,-87.58637073,"(41.746013191, -87.58637073)"
1326195,G021609,2001-01-11 02:30:41,087XX S ESCANABA AV,9901,DOMESTIC VIOLENCE,DOMESTIC VIOLENCE,APARTMENT,1,1,423,4,,,08B,1196869,1847416,2001,2015-08-17 15:03:40,41.73617608,-87.55431961,"(41.73617608, -87.554319607)"
1340847,G040244,2001-01-19 18:39:03,063XX N NAGLE AV,820,THEFT,$500 AND UNDER,GROCERY FOOD STORE,1,0,1611,16,,,6,1132586,1941599,2001,2015-08-17 15:03:40,41.99598354,-87.78763989,"(41.99598354, -87.787639887)"
1353618,G056330,2001-01-27 16:20:00,078XX S SAWYER AV,460,BATTERY,SIMPLE,RESIDENCE PORCH/HALLWAY,1,1,835,8,,,08B,1156032,1852572,2001,2015-08-17 15:03:40,41.75124194,-87.70379416,"(41.751241937, -87.703794164)"
1363954,G070193,2001-02-03 03:00:00,004XX W WRIGHTWOOD AV,460,BATTERY,SIMPLE,RESIDENCE,0,0,2333,19,,,08B,1172852,1918278,2001,2015-08-17 15:03:40,41.93119046,-87.640214,"(41.93119046, -87.640214004)"
1367327,G057394,2001-01-28 07:10:00,046XX S CICERO AV,1513,PROSTITUTION,SOLICIT FOR BUSINESS,STREET,1,0,814,8,,,16,1145110,1873073,2001,2015-08-17 15:03:40,41.80771246,-87.74330304,"(41.807712461, -87.743303038)"
1414626,G134016,2001-03-01 23:00:00,055XX S NOTTINGHAM AV,1310,CRIMINAL DAMAGE,TO PROPERTY,RESIDENCE,0,0,811,8,,,14,1130022,1866716,2001,2015-08-17 15:03:40,41.7905386,-87.79878798,"(41.790538595, -87.79878798)"
1419496,G140454,2001-03-11 16:44:05,077XX S SOUTH SHORE DR,460,BATTERY,SIMPLE,APARTMENT,0,0,421,4,,,08B,1197205,1854743,2001,2015-08-17 15:03:40,41.75627357,-87.55284517,"(41.756273565, -87.552845167)"
1427912,G122095,2001-03-02 16:20:00,039XX N ASHLAND AV,1505,PROSTITUTION,CALL OPERATION,RESIDENCE,1,0,1923,19,,,16,1164982,1926580,2001,2015-08-17 15:03:40,41.95414251,-87.66889815,"(41.954142513, -87.668898147)"


### Problem 3

##### How many crimes involve an arrest?


In [5]:
%%sql

select count(*) as CRIMES_WITH_AN_ARREST
    from CHICAGO_CRIME_DATA
     where ARREST = TRUE

 * mysql://root:***@localhost/ibm_sql_lab
1 rows affected.


CRIMES_WITH_AN_ARREST
163


### Problem 4

##### Which unique types of crimes have been recorded at GAS STATION locations?


In [6]:
%%sql

select DISTINCT(PRIMARY_TYPE), LOCATION_DESCRIPTION
    from CHICAGO_CRIME_DATA
    where LOCATION_DESCRIPTION LIKE ('%GAS STATION%')

 * mysql://root:***@localhost/ibm_sql_lab
4 rows affected.


PRIMARY_TYPE,LOCATION_DESCRIPTION
ROBBERY,GAS STATION
THEFT,GAS STATION
CRIMINAL TRESPASS,GAS STATION
NARCOTICS,GAS STATION


### Problem 5

##### In the CENUS_DATA table list all Community Areas whose names start with the letter ‘B’.


In [7]:
%%sql

select COMMUNITY_AREA_NAME
    from CHICAGO_CENSUS_DATA
    where COMMUNITY_AREA_NAME like ('B%')

 * mysql://root:***@localhost/ibm_sql_lab
5 rows affected.


COMMUNITY_AREA_NAME
Belmont Cragin
Burnside
Brighton Park
Bridgeport
Beverly


### Problem 6

##### Which schools in Community Areas 10 to 15 are healthy school certified?


In [8]:
%%sql

select NAME_OF_SCHOOL,COMMUNITY_AREA_NUMBER, COMMUNITY_AREA_NAME 
    from CHICAGO_PUBLIC_SCHOOLS
    where COMMUNITY_AREA_NUMBER between 10 and 15
        and HEALTHY_SCHOOL_CERTIFIED = TRUE

 * mysql://root:***@localhost/ibm_sql_lab
1 rows affected.


NAME_OF_SCHOOL,COMMUNITY_AREA_NUMBER,COMMUNITY_AREA_NAME
Rufus M Hitch Elementary School,10,NORWOOD PARK


### Problem 7

##### What is the average school Safety Score?


In [9]:
%%sql

select avg(SAFETY_SCORE) as AVG_SAFETY_SCORE
    from CHICAGO_PUBLIC_SCHOOLS

 * mysql://root:***@localhost/ibm_sql_lab
1 rows affected.


AVG_SAFETY_SCORE
49.5049


### Problem 8

##### List the top 5 Community Areas by average College Enrollment [number of students]


In [10]:
%%sql

select COMMUNITY_AREA_NAME, avg(COLLEGE_ENROLLMENT) as AVG_COLLEGE_ENROLLMENT 
    from CHICAGO_PUBLIC_SCHOOLS
    group by COMMUNITY_AREA_NAME
    order by AVG_COLLEGE_ENROLLMENT desc
    limit 5

 * mysql://root:***@localhost/ibm_sql_lab
5 rows affected.


COMMUNITY_AREA_NAME,AVG_COLLEGE_ENROLLMENT
ARCHER HEIGHTS,2411.5
MONTCLARE,1317.0
WEST ELSDON,1233.3333
BRIGHTON PARK,1205.875
BELMONT CRAGIN,1198.8333


### Problem 9

##### Use a sub-query to determine which Community Area has the least value for school Safety Score?


In [11]:
%%sql

select COMMUNITY_AREA_NAME, SAFETY_SCORE
    from CHICAGO_PUBLIC_SCHOOLS
    where SAFETY_SCORE = (select min(SAFETY_SCORE)
                              from CHICAGO_PUBLIC_SCHOOLS)

 * mysql://root:***@localhost/ibm_sql_lab
1 rows affected.


COMMUNITY_AREA_NAME,SAFETY_SCORE
WASHINGTON PARK,1


### Problem 10

##### [Without using an explicit JOIN operator] Find the Per Capita Income of the Community Area which has a school Safety Score of 1.


In [12]:
%%sql

select CS.COMMUNITY_AREA_NAME, CC.PER_CAPITA_INCOME_
    from CHICAGO_PUBLIC_SCHOOLS as CS
        inner join CHICAGO_CENSUS_DATA as CC ON CS.COMMUNITY_AREA_NUMBER = CC.COMMUNITY_AREA_NUMBER
    where SAFETY_SCORE = 1

 * mysql://root:***@localhost/ibm_sql_lab
1 rows affected.


COMMUNITY_AREA_NAME,PER_CAPITA_INCOME_
WASHINGTON PARK,13785


Copyright © 2020 [cognitiveclass.ai](cognitiveclass.ai?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DB0201EN-SkillsNetwork-20127838&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DB0201EN-SkillsNetwork-20127838&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ).


## Author

[Temitope Adesusi](https://www.linkedin.com/in/ttadesusi)

## Reference

[IBM Data Science](https://www.coursera.org/professional-certificates/ibm-data-science?)

[Socioeconomic Indicators in Chicago](https://github.com/ttadesusi/IBM-Data-Science-Professional-Certification/blob/master/5.%20Databases%20and%20SQL%20for%20Data%20Science/MySQL_Database-Analyzing_with_Python.ipynb)