# PANDAS CAPSTONE PROJECT- SCHOOL SAFETY

![schoolsafety.png](schoolsafety.png)

* Since 1998, the New York City Police Department (NYPD) has been tasked with the collection and maintenance of crime data for incidents that occur in New York City public schools. For presentation purposes, each incident has been classified in one of three categories. These categories are:
<br>
* **Major Crimes:** This category is consistent with those regularly and publicly reported by the NYPD. It includes the most serious personal and property crimes. The property crimes are burglary, grand larceny and grand larceny auto. The crimes against persons are murder, rape, robbery and felony assault.
<br>
* **Other Crimes:** This category is composed of many crimes and incidents that range in severity. It includes reports of incidents such as arson/explosion, misdemeanor assault, criminal possession or sale of a controlled substance, sale of marijuana, criminal mischief, petit larceny, reckless endangerment, sex offenses (not including rape, which is included in the Major Crimes), and weapons possession.
<br>
* **Non-Criminal Incidents:** This category includes actions which are not classified as crimes but are nevertheless disruptive to the school environment. It includes disorderly conduct, harassment, loitering, possession of marijuana, dangerous instruments and trespass.
<br>
NYPD and NYC Department of Education are stored this crime data as annualy school safety reports and published on https://www.data.gov/ . <br>
 __In this Data Analysis exercise, I concatenated the School Safety Reports of the 2015 and 2016, and I will try to Analyse this data.__ <br>


### IMPORTING LIBRARIES

In [1]:
import numpy as np
import pandas as pd

### STEP 1: EXAMINING DF 

#### IMPORTING DATA

* Import the school safety data and name it as:
    * 2015_16ss: `ss1516`
    * 2016_17ss: `ss1617`
* Don't forget to set `encoding="utf-8"` , `quotechar='""'`, and `delimiter=","`

In [2]:
url1 = "C:\\Users\\talfi\\python\\dersler\\capstones\\pandas\\schoolsafetynyc\\2015_16ss.csv"
url2 = "C:\\Users\\talfi\\python\\dersler\\capstones\\pandas\\schoolsafetynyc\\2016_17ss.csv"
ss1516 = pd.read_csv(url1,encoding='utf-8', quotechar='"', delimiter=',')
ss1617 = pd.read_csv(url2,encoding='utf-8', quotechar='"', delimiter=',')

* Create `ss1517` by concating `ss1516` and `ss1617`
* Use `shape` to figure out how many rows and columns our `ss1517` has.

In [3]:
ss1517 = pd.concat([ss1516,ss1617],axis=0,join="inner")
ss1517.shape

(4116, 20)

We have 4116 rows and 20 columns. This is a fair- sized dataset for our Analysis.

In [4]:
ss1517.head(3)

Unnamed: 0,Location Name,Location Code,Borough,Geographical District Code,Register,Building Name,# Schools,Schools in Building,Major N,Oth N,NoCrim N,Prop N,Vio N,ENGroupA,RangeA,AvgOfMajor N,AvgOfOth N,AvgOfNoCrim N,AvgOfProp N,AvgOfVio N
0,P.S. 001 The Bergen,K001,K,15.0,1280.0,,1.0,P.S. 001 The Bergen,0.0,1.0,0.0,1.0,0.0,7C,1251-1500,0.64,3.02,5.77,1.72,1.54
1,Parkside Preparatory Academy,K002,K,17.0,475.0,655 PARKSIDE AVENUE CONSOLIDATED LOCATION,3.0,Parkside Preparatory Academy | P.S. K141 |Exp...,,,,,,3C,251-500,,,,,
2,EXPLORE CHARTER SCHOOL(BS),K704,K,17.0,529.0,655 PARKSIDE AVENUE CONSOLIDATED LOCATION,3.0,Parkside Preparatory Academy | P.S. K141 |Exp...,,,,,,4C,501-750,,,,,


#### Explanation of the Columns is needed to understand our analysis better <br>
* __Location Name__ is the the name by which the organization is known. For a learning community, it is the official title of the school. <br>
* __Location Code__ is a unique identifier that can include schools, administrative offices, learning communities, etc. <br>
* __Borough__ is the NYC Boro the location is situated in. <br>
* __Geographical District Code__ the school’s geographical district as defined by the NYC Department of Education. <br>
* __Register__ Number of students on register. <br>
* __Building Name__ is the the official name of the building a school is located in. <br>
* __# Schools__ is the number of schools in in the building. <br>
* __Schools in the Building__ is the names of the schools in the buildings. <br>
* __Major N__ is the number of major crimes. <br>
* __Oth N__ is the number of other crimes. <br>
* __NoCrim N__ is the number of non - criminal crimes. <br>
* __Prop N__ is the number of property crimes. <br>
* __Vio N__ is the number of violent crimes. <br>
* __EnGroup A__ is the building population. <br>
* __Range A__ is the group name the building population falls under. <br>
* __AvgofMajorN__ is the average of major crimes for all buildings that have the same EnGroupA/Range A. <br>
* __AvgofOthN__ is the average of other crimes for all buildings that have the same EnGroupA/Range A. <br>
* __AvgofNoCrimN__ is the average of non-criminal crimes for all buildings that have the same EnGroupA/Range A. <br>
* __AvgofPropN__ is the average of property crimes for all buildings that have the same EnGroupA/Range A. <br>
* __AvgofVioN__ is the average of violent crimes for all buildings that have the same EnGroupA/Range A. <br>
---
Let's take a breif look of our data.

* Use `.info()` to get summary information about the `ss1517`

In [6]:
ss1517.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4116 entries, 0 to 2045
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Location Name               4115 non-null   object 
 1   Location Code               3790 non-null   object 
 2   Borough                     4113 non-null   object 
 3   Geographical District Code  4112 non-null   float64
 4   Register                    4046 non-null   float64
 5   Building Name               2373 non-null   object 
 6   # Schools                   4115 non-null   float64
 7   Schools in Building         4115 non-null   object 
 8   Major N                     2401 non-null   float64
 9   Oth N                       2401 non-null   float64
 10  NoCrim N                    2401 non-null   float64
 11  Prop N                      2401 non-null   float64
 12  Vio N                       2401 non-null   float64
 13  ENGroupA                    4073 

While 8 of our column's dtpyes are object, 12 of them's dtypes are float.
___

Let's examine the summary statistics of our ss1517 df:

* Use `.describe()` to receive summary statistics about `ss1517`

In [5]:
ss1517.describe()

Unnamed: 0,Geographical District Code,Register,# Schools,Major N,Oth N,NoCrim N,Prop N,Vio N,AvgOfMajor N,AvgOfOth N,AvgOfNoCrim N,AvgOfProp N,AvgOfVio N
count,4112.0,4046.0,4115.0,2401.0,2401.0,2401.0,2401.0,2401.0,2400.0,2400.0,2399.0,2399.0,2400.0
mean,15.419018,687.297084,2.170595,0.426489,1.746772,3.489379,1.052895,0.875052,0.438704,1.836046,3.384894,1.120796,0.879746
std,9.221523,547.373869,1.452559,0.873994,2.84971,7.013974,1.750748,1.689985,0.347415,1.669645,4.018752,0.898922,0.761603
min,0.0,25.0,1.0,0.0,0.0,0.0,0.0,0.0,0.24,0.71,1.67,0.51,0.32
25%,8.0,354.0,1.0,0.0,0.0,0.0,0.0,0.0,0.27,0.9,1.71,0.63,0.44
50%,14.0,529.5,2.0,0.0,1.0,1.0,0.0,0.0,0.28,1.11,1.74,0.75,0.53
75%,24.0,850.0,3.0,1.0,2.0,4.0,1.0,1.0,0.57,2.12,3.64,1.39,1.05
max,32.0,5682.0,8.0,8.0,25.0,88.0,16.0,13.0,3.14,13.0,34.86,8.29,5.29


### STEP 2: LOCATING & REMOVING NA VALUES

Let's check whether our df has Na values or not:
* Use `isnull().values.any()` for this purpose

In [6]:
ss1517.isnull().values.any()

True

Appearently, we have some Na values. Let's figure out how many Na values we have:

* Use `isnull().values.sum()` to see how many NA values we have in `ss1517`

In [7]:
ss1517.isnull().values.sum()

19392

19392 of our values are Na. Wow!, that's a lot. In that case, we have 3 options: 
<br>
1) We can get rid of them with `ss1517.dropna()` . We can do this but we also loose a lot of useful information because `.dropna()`__drops entire column that has Na value__ , and not every value in that column is Na. That's why we won't go with dropna().
<br>
2) We can use the `value` parameter of the `fillna()` function. In that case, we can only replace Na 's with one variable. If we try to replace them with int or float, we 'll also replace the Na values in the object columns, and the reverse is also applicable. It seems little messy. <br>
3) We can use the `method`parameter of the `fillna()`function. We can set `method='ffill'`to replace Na values with the last valid observation, or we can set `method = 'bfill'`to replace Na values with the next valid observation. If we first set `method = 'ffill'`and then set `method = 'bfill'`, we can remove all Na values with the same dtype of columns. That way we are able to protect our dataframe's structure. That's why we'll go with this option. 

* Fill the NA values or drop them with the relevant method(s). Briefly explain why do you choose the particular method or why don't you choose the others..

In [8]:
ss1517 = ss1517.fillna(method="ffill")

In [9]:
ss1517 = ss1517.fillna(method="bfill")

Na values check, once more..

* Recheck whether you have NA values in `ss1517` or not..

In [10]:
ss1517.isnull().values.any()

False

* Take a look at your data with `.info()` and evaluate your data within `ss1517`

Great ! We don't have any Na values. 
___

Let's check whether we made any changes after we implement `fillna()`

In [11]:
ss1517.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4116 entries, 0 to 2045
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Location Name               4116 non-null   object 
 1   Location Code               4116 non-null   object 
 2   Borough                     4116 non-null   object 
 3   Geographical District Code  4116 non-null   float64
 4   Register                    4116 non-null   float64
 5   Building Name               4116 non-null   object 
 6   # Schools                   4116 non-null   float64
 7   Schools in Building         4116 non-null   object 
 8   Major N                     4116 non-null   float64
 9   Oth N                       4116 non-null   float64
 10  NoCrim N                    4116 non-null   float64
 11  Prop N                      4116 non-null   float64
 12  Vio N                       4116 non-null   float64
 13  ENGroupA                    4116 

Great ! Everything seems to be in order.

In [12]:
ss1517.head(3)

Unnamed: 0,Location Name,Location Code,Borough,Geographical District Code,Register,Building Name,# Schools,Schools in Building,Major N,Oth N,NoCrim N,Prop N,Vio N,ENGroupA,RangeA,AvgOfMajor N,AvgOfOth N,AvgOfNoCrim N,AvgOfProp N,AvgOfVio N
0,P.S. 001 The Bergen,K001,K,15.0,1280.0,655 PARKSIDE AVENUE CONSOLIDATED LOCATION,1.0,P.S. 001 The Bergen,0.0,1.0,0.0,1.0,0.0,7C,1251-1500,0.64,3.02,5.77,1.72,1.54
1,Parkside Preparatory Academy,K002,K,17.0,475.0,655 PARKSIDE AVENUE CONSOLIDATED LOCATION,3.0,Parkside Preparatory Academy | P.S. K141 |Exp...,0.0,1.0,0.0,1.0,0.0,3C,251-500,0.64,3.02,5.77,1.72,1.54
2,EXPLORE CHARTER SCHOOL(BS),K704,K,17.0,529.0,655 PARKSIDE AVENUE CONSOLIDATED LOCATION,3.0,Parkside Preparatory Academy | P.S. K141 |Exp...,0.0,1.0,0.0,1.0,0.0,4C,501-750,0.64,3.02,5.77,1.72,1.54


How beautiful does our data sets look like without Na values..

### STEP 3: ANALYSING DATA

__In this dataset, major crimes are coded like this:__<br>
Burglary - 0<br>
Grand larceny - 1<br>
Grand larceny auto - 2<br>
Murder - 3<br>
Rape - 4 <br>
Robbery - 5<br>
Felony - 6<br>
Assault - 8<br>
___
Let's check it out that how is major crimes' distribution in the number of Major Crimes , a.k.a. `Major N`
* For that purpose, you can use `unique()` method.

In [13]:
avgmajor = ss1517["Major N"].unique()
print(avgmajor)

[0. 3. 1. 2. 4. 6. 5. 8.]


Though we have 7 different crimes, Burglary(0) and Grand Larcery(1) are so dominant that other crimes couldn't show themselves on the `Major N` .This explanation will be understood better once we plot our `Major N`(Number of Major Crimes) column.
___

Let's separate these two and dive deeper. and name it as `bigtwo` <br>
Hint: You can use the following structure: `df[df["Column"] <= yourfilter]`

### BIG TWO

In [14]:
bigtwo = ss1517[ss1517["Major N"] <= 1]

Let's examine our big two.
* Start with `shape`, then proceed with `info()`
* You can also print the first five columns for that purpose.

In [15]:
bigtwo.shape

(3606, 20)

In [44]:
bigtwo.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3606 entries, 0 to 2045
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Location Name               3606 non-null   object 
 1   Location Code               3606 non-null   object 
 2   Borough                     3606 non-null   object 
 3   Geographical District Code  3606 non-null   float64
 4   Register                    3606 non-null   float64
 5   Building Name               3606 non-null   object 
 6   # Schools                   3606 non-null   float64
 7   Schools in Building         3606 non-null   object 
 8   Major N                     3606 non-null   float64
 9   Oth N                       3606 non-null   float64
 10  NoCrim N                    3606 non-null   float64
 11  Prop N                      3606 non-null   float64
 12  Vio N                       3606 non-null   float64
 13  ENGroupA                    3606 

In [16]:
bigtwo.head()

Unnamed: 0,Location Name,Location Code,Borough,Geographical District Code,Register,Building Name,# Schools,Schools in Building,Major N,Oth N,NoCrim N,Prop N,Vio N,ENGroupA,RangeA,AvgOfMajor N,AvgOfOth N,AvgOfNoCrim N,AvgOfProp N,AvgOfVio N
0,P.S. 001 The Bergen,K001,K,15.0,1280.0,655 PARKSIDE AVENUE CONSOLIDATED LOCATION,1.0,P.S. 001 The Bergen,0.0,1.0,0.0,1.0,0.0,7C,1251-1500,0.64,3.02,5.77,1.72,1.54
1,Parkside Preparatory Academy,K002,K,17.0,475.0,655 PARKSIDE AVENUE CONSOLIDATED LOCATION,3.0,Parkside Preparatory Academy | P.S. K141 |Exp...,0.0,1.0,0.0,1.0,0.0,3C,251-500,0.64,3.02,5.77,1.72,1.54
2,EXPLORE CHARTER SCHOOL(BS),K704,K,17.0,529.0,655 PARKSIDE AVENUE CONSOLIDATED LOCATION,3.0,Parkside Preparatory Academy | P.S. K141 |Exp...,0.0,1.0,0.0,1.0,0.0,4C,501-750,0.64,3.02,5.77,1.72,1.54
3,P.S. K141,K141,K,17.0,374.0,655 PARKSIDE AVENUE CONSOLIDATED LOCATION,3.0,Parkside Preparatory Academy | P.S. K141 |Exp...,0.0,1.0,0.0,1.0,0.0,3C,251-500,0.64,3.02,5.77,1.72,1.54
4,655 PARKSIDE AVENUE CONSOLIDATED LOCATION,K141,K,17.0,1378.0,655 PARKSIDE AVENUE CONSOLIDATED LOCATION,3.0,Parkside Preparatory Academy | P.S. K141 |Exp...,0.0,3.0,1.0,0.0,1.0,5C,751-1000,0.38,1.71,2.38,0.98,0.82


There might be relevance between the tow big crimes and the Borough that they are commited. 
___
* To figure out relevance, we need to know our Borough values. Let's find the unique values in Borough column by `unique()`function. <br>
* Use `unique()` and return the unique values of `Borough` column of `bigtwo`


In [17]:
bigtwo['Borough'].unique()

array(['K', 'O', 'M', 'Q', 'R', 'X'], dtype=object)

In that case : <br>
__M__ represents Manhattan. <br>
__Q__ represents Queens. <br>
__R__ represents Rikers Island. <br>
__K__ represents Brooklyn. <br>
__X__ represents The Bronx. <br>
__O__ represents Staten Island. <br>

* Now, as we know Borough's and their actual names, we can examine the relationship between the bigtwo and Boroughs. We can do this through grouping them by Borough's. In that case, the most handy tool is  `groupby()`function.
* Use `groupby()` and group `bigtwo` by it's `Borough`

In [18]:
bigtwo.groupby("Borough")["Major N"].sum()

Borough
K    233.0
M    161.0
O      0.0
Q     87.0
R     18.0
X    201.0
Name: Major N, dtype: float64

* Evaluate the result **with your own words**. Remember, Data Science is all about explaining the story of the data. So, try your best and seeze the story behind the data..

Well, it seems that burglary and grand larceny crimes are commited in Brooklyn first place, then the Bronx, then Manhattan. We can say that burglary and grand larceny crimes in NYC Public Schools are not relevant with the wealth because the richest borough of the NYC is Manhattan; and there are more burglary and grand larceny crimes commited in Manhattan than the poorest Borough in NYC, the Bronx.
___
Hmm, how about adding a new variable in our equasion and looking from different perspective ? <br>
Let's examine our big two by the Borough's and the number of students in each schools, a.k.a. `Register` <br>
That way, we can evaluate our bigtwo not only with Borough's and the wealth of them, but also with the population's of Borough's and their effects on burglary and grand larceny.
___


Now, we can say that there are much more grand larceny crime is commited than burglary in Manhattan. For other Borough's , everything is pretty much same. 

### OTHER MAJOR CRIMES

Now, it is time to talk about the other major crimes : <br>
Grand larceny auto - 2<br>
Murder - 3<br>
Rape - 4 <br>
Robbery - 5<br>
Felony - 6<br>
Assault - 8<br>
___
As they did not occur as frequent as the burglary and the grand larceny, their contest is much more serious than bigtwo. Let's create a new dataframe and name it as `othercrimes`:
* Hint: You can use the following structure: `df[df["Column"] <= yourfilter]`

In [19]:
othercrimes = ss1517[ss1517["Major N"] > 1]

Let's examine the other crimes: 
* Start with `shape`, then proceed with `info()`
* You can also print the first five columns for that purpose.

In [20]:
othercrimes.shape

(510, 20)

In [34]:
othercrimes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 510 entries, 9 to 2031
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Location Name               510 non-null    object 
 1   Location Code               510 non-null    object 
 2   Borough                     510 non-null    object 
 3   Geographical District Code  510 non-null    float64
 4   Register                    510 non-null    float64
 5   Building Name               510 non-null    object 
 6   # Schools                   510 non-null    float64
 7   Schools in Building         510 non-null    object 
 8   Major N                     510 non-null    float64
 9   Oth N                       510 non-null    float64
 10  NoCrim N                    510 non-null    float64
 11  Prop N                      510 non-null    float64
 12  Vio N                       510 non-null    float64
 13  ENGroupA                    510 no

In [21]:
othercrimes.head(3)

Unnamed: 0,Location Name,Location Code,Borough,Geographical District Code,Register,Building Name,# Schools,Schools in Building,Major N,Oth N,NoCrim N,Prop N,Vio N,ENGroupA,RangeA,AvgOfMajor N,AvgOfOth N,AvgOfNoCrim N,AvgOfProp N,AvgOfVio N
9,P.S. 008 Robert Fulton,K008,K,13.0,924.0,655 PARKSIDE AVENUE CONSOLIDATED LOCATION,1.0,P.S. 008 Robert Fulton,3.0,0.0,0.0,3.0,0.0,5C,751-1000,0.38,1.71,2.38,0.98,0.82
10,P.S. 009 Teunis G. Bergen,K009,K,13.0,838.0,80 UNDERHILL AVENUE CONSOLIDATED LOCATION,2.0,P.S. 009 Teunis G. Bergen|Brooklyn East Colleg...,3.0,0.0,0.0,3.0,0.0,5C,751-1000,0.38,1.71,2.38,0.98,0.82
11,BROOKLYN EAST COLLEGIATE CHARTER SCHOOL(BN),K780,K,13.0,388.0,80 UNDERHILL AVENUE CONSOLIDATED LOCATION,2.0,P.S. 009 Teunis G. Bergen|Brooklyn East Colleg...,3.0,0.0,0.0,3.0,0.0,3C,251-500,0.38,1.71,2.38,0.98,0.82


It seems that we won't use some of the columns in our dataset. Let's get rid of them. <br>
We can do this by `drop()`function of Pandas. With `drop`we can drop columns.
___
* After careful review, your team lead thinks that `Location Code` column is unnecessary for our analysis.
* Drop the `Location Code` column from `othercrimes` df.
* After that, print the first 5 rows to check whether `Location Code` is dropped or not. 

In [22]:
othercrimes = othercrimes.drop("Location Code", axis = 1)
othercrimes.head()

Unnamed: 0,Location Name,Borough,Geographical District Code,Register,Building Name,# Schools,Schools in Building,Major N,Oth N,NoCrim N,Prop N,Vio N,ENGroupA,RangeA,AvgOfMajor N,AvgOfOth N,AvgOfNoCrim N,AvgOfProp N,AvgOfVio N
9,P.S. 008 Robert Fulton,K,13.0,924.0,655 PARKSIDE AVENUE CONSOLIDATED LOCATION,1.0,P.S. 008 Robert Fulton,3.0,0.0,0.0,3.0,0.0,5C,751-1000,0.38,1.71,2.38,0.98,0.82
10,P.S. 009 Teunis G. Bergen,K,13.0,838.0,80 UNDERHILL AVENUE CONSOLIDATED LOCATION,2.0,P.S. 009 Teunis G. Bergen|Brooklyn East Colleg...,3.0,0.0,0.0,3.0,0.0,5C,751-1000,0.38,1.71,2.38,0.98,0.82
11,BROOKLYN EAST COLLEGIATE CHARTER SCHOOL(BN),K,13.0,388.0,80 UNDERHILL AVENUE CONSOLIDATED LOCATION,2.0,P.S. 009 Teunis G. Bergen|Brooklyn East Colleg...,3.0,0.0,0.0,3.0,0.0,3C,251-500,0.38,1.71,2.38,0.98,0.82
69,272 MACDONOUGH STREET CONSOLIDATED LOCATION,K,16.0,390.0,272 MACDONOUGH STREET CONSOLIDATED LOCATION,2.0,Brooklyn Brownstone School | M.S. 035 Stephen...,2.0,0.0,1.0,1.0,1.0,3C,251-500,0.27,0.9,1.67,0.63,0.44
74,265 RALPH AVENUE CONSOLIDATED LOCATION,K,16.0,493.0,265 RALPH AVENUE CONSOLIDATED LOCATION,2.0,P.S. 040 George W. Carver | Gotham Profession...,2.0,2.0,1.0,3.0,1.0,3C,251-500,0.27,0.9,1.67,0.63,0.44


Great ! In order to make smooother analysis, we'll only need 3 columns : <br>
`# Schools` : Number of schools in the building. We are going to need this column because different schools means different population categories such as age and culture. Different categories might be the major crimes in school. <br>
`EnGroupA`: Building population. <br>
`Major N`: Other major crimes.

To study with only these 3 columns, we need to reshape our df.We can do this by `loc`and `iloc`
* Update the `othercrimes` as it has only `# Schools`, `EnGroupA`, and `Major N` columns, nothing more. Use `loc` or `iloc` for this purpose.

In [23]:
othercrimes = othercrimes.iloc[:,[5,7,12]]

In [24]:
othercrimes.head()

Unnamed: 0,# Schools,Major N,ENGroupA
9,1.0,3.0,5C
10,2.0,3.0,5C
11,2.0,3.0,3C
69,2.0,2.0,3C
74,2.0,2.0,3C


* Let's sort the values by Major N codes to see the relevance better.
* Use `sort_values()` for that purpose.

In [25]:
othercrimes = othercrimes.sort_values(by = "Major N")
othercrimes.head(3)

Unnamed: 0,# Schools,Major N,ENGroupA
1949,6.0,2.0,10C
258,2.0,2.0,4C
257,2.0,2.0,9C


* Great! To see the effects of Building Population and Number of Other Schools in the Building on Other Major Crimes, let's pivot it. We can pivot it by `pivot_table`function of pandas.
* Use `pivot_table` as follows:
    * For index, use `EnGroupA`,
    * For columns, use `# Schools`
    * For values, use `Major N`

In [26]:
pivot_table = othercrimes.pivot_table(index=["ENGroupA"],
                                     columns="# Schools", 
                                     values='Major N')
pivot_table

# Schools,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0
ENGroupA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
10C,2.2,3.166667,4.5,2.25,2.0,2.428571,5.0,4.0
11C,2.0,5.0,,,4.0,3.5,3.2,
12C,2.818182,3.0,,,,,3.5,
13C,2.75,3.0,,,,,,
2C,2.333333,2.304348,3.117647,4.5,3.142857,3.0,2.0,
3C,2.2,2.418605,2.72973,3.666667,2.619048,3.454545,2.263158,
4C,2.333333,2.470588,2.4,3.0,3.0,3.666667,2.5,
5C,2.333333,2.266667,2.666667,2.666667,2.0,,,
6C,2.666667,2.416667,2.315789,2.0,,,3.0,
7C,2.6,2.333333,2.8,3.428571,3.5,2.333333,,


We have Na values, again.. Replace them as we did earlier, with `bfill`and `ffill`parameters of `pd.fillna()`

In [27]:
pivot_table = pivot_table.fillna(method = 'bfill')

In [28]:
pivot_table = pivot_table.fillna(method = 'ffill')

In [29]:
pivot_table.isnull().values.any()

False

Print the first 10 columns of `

In [30]:
pivot_table.head(10)

# Schools,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0
ENGroupA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
10C,2.2,3.166667,4.5,2.25,2.0,2.428571,5.0,4.0
11C,2.0,5.0,3.117647,4.5,4.0,3.5,3.2,4.0
12C,2.818182,3.0,3.117647,4.5,3.142857,3.0,3.5,4.0
13C,2.75,3.0,3.117647,4.5,3.142857,3.0,2.0,4.0
2C,2.333333,2.304348,3.117647,4.5,3.142857,3.0,2.0,4.0
3C,2.2,2.418605,2.72973,3.666667,2.619048,3.454545,2.263158,4.0
4C,2.333333,2.470588,2.4,3.0,3.0,3.666667,2.5,4.0
5C,2.333333,2.266667,2.666667,2.666667,2.0,2.333333,3.0,4.0
6C,2.666667,2.416667,2.315789,2.0,3.5,2.333333,3.0,4.0
7C,2.6,2.333333,2.8,3.428571,3.5,2.333333,3.0,4.0


Great ! In order to make smooother analysis, we'll only need 3 columns : <br>
`# Schools` : Number of schools in the building. We are going to need this column because different schools means different population categories such as age and culture. Different categories might be the major crimes in school. <br>
`EnGroupA`: Building population. <br>
`Major N`: Other major crimes.

It seems that our pivot will be useful if we want to look the direct relationship, but it is also not easy to understand. Let's keep it simple for everyone, and use scatter plot. <br>
Let's examine the effects of number of schools in the building and building population on other major crimes. In that case, scatter plot serves our purpose best because we are trying to show the relationship between two variables and effect of these two on third variable, other major crimes.

Well, let's see.. Major crime number 6(felony) is very dense in the buildings with 3 schools. There is also a linear relationship between the building population and major crime kind in the buildings with 1 school. 