# City Schools Analysis

**By:** Tania Barrera (*tsbarr*)

This Jupyter Notebook goes over my analysis of city schools data for Challenge 4 of the UofT SCS EdX Data Bootcamp, using the python module `pandas`.

It includes a summary of the whole district and of each school, including displaying the highest and lowest performing schools, as well as math and reading grades analysis by grade, school spending, school size and school type.

## Initial Setup

The first step before performing any analysis is importing the necessay modules and reading the input data.

The imports I am using for this project are:

- Module **`pandas`**: to perform dataframe analysis
- Subclass **`Path`** from the **`pathlib`** module: to create the file path object that is used to read in data.


In [1]:
# Import modules
import pandas as pd
from pathlib import Path

There are two datasets that are imported for this project: School Data and Student Data.

The School Data has the columns:

- School ID: unique id number as an integer, starting from 0.
- school_name: the name of this school
- type: can be District or Charter
- size: number of students in this school
- budget: budget of this school

And the Student Data has the columns:

- Student ID: unique id number as an integer, starting from 0.
- student_name: name of this student
- gender: F for female or M for male
- grade: as an ordinal. 9th, 10th, 11th or 12th
- school_name: what school the student is in, should correspond to one of the values in column school_name of the school dataset
- reading_score: an integer up to 100
- math_score: an integer up to 100

I left joined these two dataframes on school_name into the dataframe allData.

In [2]:
# Input file paths
schoolInPath = Path("Resources/schools_complete.csv")
studentInPath = Path("Resources/students_complete.csv")

# Read School and Student Data and store into Pandas DataFrames
schoolData = pd.read_csv(schoolInPath)
# use multiIndex to later work with merged data
studentData = pd.read_csv(studentInPath)

# Combine the data into a single dataset.
# from guide: https://pandas.pydata.org/pandas-docs/version/0.24.0/user_guide/merging.html
# resetting the first level of the multiIndex (school_name) 
# so the multiIndex in the joined data only uses the unique IDs
# source: https://stackoverflow.com/a/70885826/22248087
allData = schoolData\
    .merge(studentData, how='left', on='school_name')

# visualize first rows of combined data set
allData.head()


Unnamed: 0,School ID,school_name,type,size,budget,Student ID,student_name,gender,grade,reading_score,math_score
0,0,Huang High School,District,2917,1910635,0,Paul Bradley,M,9th,66,79
1,0,Huang High School,District,2917,1910635,1,Victor Smith,M,12th,94,61
2,0,Huang High School,District,2917,1910635,2,Kevin Rodriguez,M,12th,90,60
3,0,Huang High School,District,2917,1910635,3,Dr. Richard Scott,M,12th,67,58
4,0,Huang High School,District,2917,1910635,4,Bonnie Ray,F,9th,97,84


Finally, I assigned the passing grade to a variable so I could modify it later and it would not be hard-coded in the analysis.

In [3]:
# setup passing grade
passing_grade = 70

## District Summary

Here I create a high-level snapshot of the district's key metrics in a DataFrame.

In [4]:
# compute total number of students
student_total = studentData['Student ID'].count()

# compute number of students that passed
number_pass_math = len(studentData.query('math_score > @passing_grade'))
number_pass_reading = len(studentData.query('reading_score > @passing_grade'))
number_pass_overall = len(studentData.query('math_score > @passing_grade and reading_score > @passing_grade'))

In [5]:
districtSummary = pd.DataFrame(
    {
        'number_of_schools': schoolData['School ID'].count()
        , 'number_of_students': student_total
        , 'total_budget' : schoolData['budget'].sum()
        , 'average_math_score' : studentData['math_score'].mean()
        , 'average_reading_score' : studentData['reading_score'].mean()
        , 'percent_pass_math' : number_pass_math / student_total
        , 'percent_pass_reading' : number_pass_reading / student_total
        , 'percent_pass_overall' : number_pass_overall / student_total
    }, index= [0] 
)
pd.melt(districtSummary)

Unnamed: 0,variable,value
0,number_of_schools,15.0
1,number_of_students,39170.0
2,total_budget,24649430.0
3,average_math_score,78.98537
4,average_reading_score,81.87784
5,percent_pass_math,0.7239214
6,percent_pass_reading,0.8297166
7,percent_pass_overall,0.6080163
