# City Schools Analysis

**By:** Tania Barrera (*tsbarr*)

This Jupyter Notebook goes over my analysis of city schools data for Challenge 4 of the UofT SCS EdX Data Bootcamp, using the python module `pandas`.

It includes a summary of the whole district and of each school, including displaying the highest and lowest performing schools, as well as math and reading grades analysis by grade, school spending, school size and school type.

## Initial Setup

### Imports

The first step before performing any analysis is importing the necessary modules.

The imports I am using for this project are:

- Module **`pandas`**: to perform dataframe analysis
- Subclass **`Path`** from the **`pathlib`** module: to create the file path object that is used to read in data.


In [1]:
# Import modules
import pandas as pd
import numpy as np
from pathlib import Path

### Input data

There are two datasets that are imported for this project: School Data and Student Data.

The School Data has the columns:

- School ID: unique id number as an integer, starting from 0.
- school_name: the name of this school
- type: can be District or Charter
- size: number of students in this school
- budget: budget of this school

And the Student Data has the columns:

- Student ID: unique id number as an integer, starting from 0.
- student_name: name of this student
- gender: F for female or M for male
- grade: as an ordinal. 9th, 10th, 11th or 12th
- school_name: what school the student is in, should correspond to one of the values in column school_name of the school dataset
- reading_score: an integer up to 100
- reading_score: an integer up to 100

I left joined these two dataframes on school_name into the dataframe allData.

In [2]:
# Input file paths
school_input_path = Path("Resources/schools_complete.csv")
student_input_path = Path("Resources/students_complete.csv")

# Read School and Student Data and store into Pandas DataFrames
school_data = pd.read_csv(school_input_path)
# use multiIndex to later work with merged data
student_data = pd.read_csv(student_input_path)

# Combine the data into a single dataset.
# from guide: https://pandas.pydata.org/pandas-docs/version/0.24.0/user_guide/merging.html
# resetting the first level of the multiIndex (school_name) 
# so the multiIndex in the joined data only uses the unique IDs
# source: https://stackoverflow.com/a/70885826/22248087
all_data = school_data\
    .merge(student_data, how='left', on='school_name')

# visualize first rows of combined data set
all_data.head()


Unnamed: 0,School ID,school_name,type,size,budget,Student ID,student_name,gender,grade,reading_score,math_score
0,0,Huang High School,District,2917,1910635,0,Paul Bradley,M,9th,66,79
1,0,Huang High School,District,2917,1910635,1,Victor Smith,M,12th,94,61
2,0,Huang High School,District,2917,1910635,2,Kevin Rodriguez,M,12th,90,60
3,0,Huang High School,District,2917,1910635,3,Dr. Richard Scott,M,12th,67,58
4,0,Huang High School,District,2917,1910635,4,Bonnie Ray,F,9th,97,84


### Helper columns

Since several of the requested summaries need me to calculate the percentage of students that passed math, reading or both over different grouping variables, I generate 3 extra boolean columns that indicate if the student passed to help me calculate these.

In [3]:
# https://sparkbyexamples.com/pandas/pandas-add-column-based-on-another-column/

all_data['passed_math'] = np.where(all_data['math_score'] >= 70, True, False)
all_data['passed_reading'] = np.where(all_data['reading_score'] >= 70, True, False)
all_data['passed_overall'] = np.where((all_data['passed_math'] & all_data['passed_reading']), True, False)

all_data.head()

Unnamed: 0,School ID,school_name,type,size,budget,Student ID,student_name,gender,grade,reading_score,math_score,passed_math,passed_reading,passed_overall
0,0,Huang High School,District,2917,1910635,0,Paul Bradley,M,9th,66,79,True,False,False
1,0,Huang High School,District,2917,1910635,1,Victor Smith,M,12th,94,61,False,True,False
2,0,Huang High School,District,2917,1910635,2,Kevin Rodriguez,M,12th,90,60,False,True,False
3,0,Huang High School,District,2917,1910635,3,Dr. Richard Scott,M,12th,67,58,False,False,False
4,0,Huang High School,District,2917,1910635,4,Bonnie Ray,F,9th,97,84,True,True,True


## District Summary

Here I create a high-level snapshot of the district's key metrics in a DataFrame.

In [4]:
district_summary = pd.DataFrame(
    {
        'Total number of unique schools': [school_data['School ID'].count()],
        'Total students': [student_data['Student ID'].count()],
        'Total budget' : [school_data['budget'].sum()],
        'Average math score': [student_data['math_score'].mean()],
        'Average reading score': [student_data['reading_score'].mean()],
        '% passing math': [all_data['passed_math'].sum() / student_data['Student ID'].count()],
        '% passing reading' : [all_data['passed_reading'].sum() / student_data['Student ID'].count()],
        '% passing overall' : [all_data['passed_overall'].sum() / student_data['Student ID'].count()]
    }
)
district_summary


Unnamed: 0,Total number of unique schools,Total students,Total budget,Average math score,Average reading score,% passing math,% passing reading,% passing overall
0,15,39170,24649428,78.985371,81.87784,0.749809,0.858055,0.651723


## School Summary

Out of the requested fields for the school summary data frame, there are four that we already got with the input school dataframe: school name, school type, total students and total school budget.

The average math and reading score can be obtained by grouping the data by school_name and getting the mean of these scores.

In [14]:
# first, group all data by school, then we can get these:

# Average math score
# Average reading score

school_summary = (all_data
    .groupby(['school_name', 'type', 'size', 'budget'], as_index=False)
    .aggregate(
        average_math_score = ('math_score', 'mean'),
        average_reading_score = ('reading_score', 'mean'),
        count_pass_math = ('passed_math', 'sum'),
        count_pass_reading = ('passed_reading', 'sum'),
        count_pass_overall = ('passed_overall', 'sum')
    )
    # https://sparkbyexamples.com/pandas/pandas-add-column-based-on-another-column/
    .assign(
        per_student_budget = lambda df: df['budget'] / df['size'],
        percent_passing_math = lambda df: df['count_pass_math'] / df['size'],
        percent_passing_reading = lambda df: df['count_pass_reading'] / df['size'],
        percent_passing_overall = lambda df: df['count_pass_overall'] / df['size']
    )
    .drop(columns = ['count_pass_math', 'count_pass_reading', 'count_pass_overall'])
    .rename(
        columns = {
            'type' : 'School type',
            'size' : 'Total students',
            'budget' : 'Total school budget'
        }
    )
    .rename(
        columns = lambda x: 
        (x.replace('percent', '%')
            .replace('_', ' ')
            .capitalize()
        )
    )
)

school_summary

Unnamed: 0,School name,School type,Total students,Total school budget,Average math score,Average reading score,Per student budget,% passing math,% passing reading,% passing overall
0,Bailey High School,District,4976,3124928,77.048432,81.033963,628.0,0.666801,0.819333,0.546423
1,Cabrera High School,Charter,1858,1081356,83.061895,83.97578,582.0,0.941335,0.970398,0.913348
2,Figueroa High School,District,2949,1884411,76.711767,81.15802,639.0,0.659885,0.807392,0.532045
3,Ford High School,District,2739,1763916,77.102592,80.746258,644.0,0.683096,0.79299,0.542899
4,Griffin High School,Charter,1468,917500,83.351499,83.816757,625.0,0.933924,0.97139,0.905995
5,Hernandez High School,District,4635,3022020,77.289752,80.934412,652.0,0.66753,0.80863,0.535275
6,Holden High School,Charter,427,248087,83.803279,83.814988,581.0,0.925059,0.962529,0.892272
7,Huang High School,District,2917,1910635,76.629414,81.182722,655.0,0.656839,0.813164,0.535139
8,Johnson High School,District,4761,3094650,77.072464,80.966394,650.0,0.660576,0.812224,0.535392
9,Pena High School,Charter,962,585858,83.839917,84.044699,609.0,0.945946,0.959459,0.905405


## Highest-Performing Schools (by % Overall Passing)

In [15]:
top_schools = school_summary.nlargest(5, '% passing overall')

top_schools

Unnamed: 0,School name,School type,Total students,Total school budget,Average math score,Average reading score,Per student budget,% passing math,% passing reading,% passing overall
1,Cabrera High School,Charter,1858,1081356,83.061895,83.97578,582.0,0.941335,0.970398,0.913348
12,Thomas High School,Charter,1635,1043130,83.418349,83.84893,638.0,0.932722,0.973089,0.90948
4,Griffin High School,Charter,1468,917500,83.351499,83.816757,625.0,0.933924,0.97139,0.905995
13,Wilson High School,Charter,2283,1319574,83.274201,83.989488,578.0,0.938677,0.965396,0.905826
9,Pena High School,Charter,962,585858,83.839917,84.044699,609.0,0.945946,0.959459,0.905405


## Lowest-Performing Schools (by % Overall Passing)

In [16]:
bottom_schools = school_summary.nsmallest(5, '% passing overall')

bottom_schools

Unnamed: 0,School name,School type,Total students,Total school budget,Average math score,Average reading score,Per student budget,% passing math,% passing reading,% passing overall
10,Rodriguez High School,District,3999,2547363,76.842711,80.744686,637.0,0.663666,0.802201,0.529882
2,Figueroa High School,District,2949,1884411,76.711767,81.15802,639.0,0.659885,0.807392,0.532045
7,Huang High School,District,2917,1910635,76.629414,81.182722,655.0,0.656839,0.813164,0.535139
5,Hernandez High School,District,4635,3022020,77.289752,80.934412,652.0,0.66753,0.80863,0.535275
8,Johnson High School,District,4761,3094650,77.072464,80.966394,650.0,0.660576,0.812224,0.535392


## Math Scores by Grade


In [23]:
math_scores_by_grade = (all_data
    .groupby('grade', as_index = False)
    .aggregate(math_scores_by_grade = ('math_score', 'mean'))
    .rename(columns = lambda x: x.replace('_', ' ').capitalize())
    .sort_values('Grade')
)

# https://www.statology.org/pandas-sort-by-string/

# get the numbers from the Grade column
math_scores_by_grade['sort'] = (math_scores_by_grade['Grade']
    .str
    .extract('(\d+)', expand = False)
    .astype(int)
)

# sort rows based on digits in 'sort' column then drop that column
math_scores_by_grade = (math_scores_by_grade
    .sort_values('sort')
    .drop('sort', axis=1)
)

math_scores_by_grade

Unnamed: 0,Grade,Math scores by grade
3,9th,78.935659
0,10th,78.941483
1,11th,79.083548
2,12th,78.993164


## Reading Scores by Grade


In [25]:
reading_scores_by_grade = (all_data
    .groupby('grade', as_index = False)
    .aggregate(math_scores_by_grade = ('reading_score', 'mean'))
    .rename(columns = lambda x: x.replace('_', ' ').capitalize())
    .sort_values('Grade')
)

# https://www.statology.org/pandas-sort-by-string/

# get the numbers from the Grade column
reading_scores_by_grade['sort'] = (reading_scores_by_grade['Grade']
    .str
    .extract('(\d+)', expand = False)
    .astype(int)
)

# sort rows based on digits in 'sort' column then drop that column
reading_scores_by_grade = (reading_scores_by_grade
    .sort_values('sort')
    .drop('sort', axis=1)
)

reading_scores_by_grade

Unnamed: 0,Grade,Math scores by grade
3,9th,81.914358
0,10th,81.87441
1,11th,81.885714
2,12th,81.819851


## Scores by School Spending