## DATA 602 Fall 2024 - Assignment 7
### Stephanie Chiang

## NYC SHSAT

For this assignment, I will examine the NYC middle schools and the numbers of students who "participated in High School Admissions, the number of those students who took the Specialized High Schools Admissions Test (SHSAT) and the number who received an offer to one of the 8 testing Specialized High Schools" for the 2020-2021 school year.

The raw file for the *2020-2021 SHSAT Admissions Test Offers By Sending School*, provided by the NYC Department of Education, is available for download [here]("https://data.cityofnewyork.us/Education/2020-2021-SHSAT-Admissions-Test-Offers-By-Sending-/k8ah-28f4/about_data").


### Data Exploration

- importing the dataset and creating dataframes
- missing value information
- relevant information about the dataset
- summary statistics means, medians, quartiles

In [129]:
import pandas as pd

data = pd.read_csv("2021shsat.csv")

# modifying multiple column names to more code-friendly format
data.rename(
  columns = {
    "Feeder School DBN": "DBN",
    "Feeder School Name": "school",
    "Count of Students in HS Admissions": "hs_bound",
    "Count of Testers": "testers",
    "Number of Offers": "offers"
  },
  inplace = True
)

print(data.isna().sum())
# there appears to be no missing data

print(data.shape)
# there are 658 rows (observations) and 5 columns:
# the middle school's District/Borough Number
# the name of the middle school
# the number of students who participated in HS admissions
# the number of studetns who took the SHSAT
# and the number of students who received an offer

print(data.dtypes)
# all the columns are listed as 'object' which indicates mixed types or categorical features
# this is because '0-5' appears as a (very common) possible value
# so pandas was unable to automatically convert these columns to integer types

DBN         0
school      0
hs_bound    0
testers     0
offers      0
dtype: int64
(658, 5)
DBN         object
school      object
hs_bound    object
testers     object
offers      object
dtype: object



The last 3 columns include values of "0-5" for any count of 5 students or fewer. There are a few different ways this could be handled, but for the sake of calculating summary statistics, I will convert these columns to floats, with a mean of 2.5 for the "0-5" values. Since there are no half-students counted in the original data, this means they can still be grouped easily from the rest of the data.

In [130]:
# convert variables to proper types

data["hs_bound"] = data["hs_bound"].where(data["hs_bound"] != "0-5", "2.5")
data["testers"] = data["testers"].where(data["testers"] != "0-5", "2.5")
data["offers"] = data["offers"].where(data["offers"] != "0-5", "2.5")

data[["hs_bound", "testers", "offers"]] = data[["hs_bound", "testers", "offers"]].astype(float)

print(data.dtypes)
print(data.head(5))

DBN          object
school       object
hs_bound    float64
testers     float64
offers      float64
dtype: object
      DBN                                          school  hs_bound  testers  \
0  01M034         P.S. 034 FRANKLIN D. ROOSEVELT (01M034)      44.0      2.5   
1  01M140                 P.S. 140 NATHAN STRAUS (01M140)      56.0      9.0   
2  01M184                   P.S. 184M SHUANG WEN (01M184)     112.0     79.0   
3  01M188             P.S. 188 THE ISLAND SCHOOL (01M188)      49.0      2.5   
4  01M332  UNIVERSITY NEIGHBORHOOD MIDDLE SCHOOL (01M332)      70.0     10.0   

   offers  
0     2.5  
1     2.5  
2    29.0  
3     2.5  
4     2.5  


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data["hs_bound"].where(data["hs_bound"] != "0-5", "2.5", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data["testers"].where(data["testers"] != "0-5", "2.5", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the inte

In [131]:

# summary statistics: means, medians, quartiles

mean_h = data['hs_bound'].mean()
mean_t = data['testers'].mean()
mean_o = data['offers'].mean()

print(mean_h)
print(mean_t)
print(mean_o)

median_h = data['hs_bound'].median()
median_t = data['testers'].median()
median_o = data['offers'].median()

print(median_h)
print(median_t)
print(median_o)

quartiles_h = data['hs_bound'].quantile([0.25, 0.5, 0.75])
quartiles_t = data['testers'].quantile([0.25, 0.5, 0.75])
quartiles_o = data['offers'].quantile([0.25, 0.5, 0.75])

print(quartiles_h)
print(quartiles_t)
print(quartiles_o)


110.29103343465046
33.48176291793313
7.302431610942249
77.0
16.0
2.5
0.25     48.00
0.50     77.00
0.75    119.75
Name: hs_bound, dtype: float64
0.25     7.0
0.50    16.0
0.75    33.0
Name: testers, dtype: float64
0.25    2.5
0.50    2.5
0.75    2.5
Name: offers, dtype: float64


### DATA WRANGLING

In [132]:
# Create new columns based on existing columns or calculations.
data["pct_testers"] = data["testers"] / data["hs_bound"]
data["pct_offers"] = data["offers"] / data["hs_bound"]

print(data.head(5))

# Drop column(s) from your dataset.
data = data.drop(["DBN"], axis=1)

# Drop a row(s) from your dataset.
idx_max = data['hs_bound'].idxmax()
data = data.drop([idx_max])

print(data.shape)

      DBN                                          school  hs_bound  testers  \
0  01M034         P.S. 034 FRANKLIN D. ROOSEVELT (01M034)      44.0      2.5   
1  01M140                 P.S. 140 NATHAN STRAUS (01M140)      56.0      9.0   
2  01M184                   P.S. 184M SHUANG WEN (01M184)     112.0     79.0   
3  01M188             P.S. 188 THE ISLAND SCHOOL (01M188)      49.0      2.5   
4  01M332  UNIVERSITY NEIGHBORHOOD MIDDLE SCHOOL (01M332)      70.0     10.0   

   offers  pct_testers  pct_offers  
0     2.5     0.056818    0.056818  
1     2.5     0.160714    0.044643  
2    29.0     0.705357    0.258929  
3     2.5     0.051020    0.051020  
4     2.5     0.142857    0.035714  
(657, 6)


In [None]:
# Sort your data based on multiple variables.
data = data.sort_values(by=['pct_offers', 'school'], ascending=[False, True])

print(data.head(5))

# Filter your data based on some condition.
ignore_small_vals = data[data["testers"] != 2.5]

print(ignore_small_vals.head(5))

# Convert all the string values to upper or lower cases in one column.
data_lower = data["school"].str.lower()

print(data_lower.head(5))


In [None]:

# Group your dataset by one column, and get the mean, min, and max values by group.

grouped_data = data.groupby("school").agg({"hs_bound": ["mean", "min", "max"], "testers": ["mean", "min", "max"], "offers": ["mean", "min", "max"]})


# Groupby()
# agg() or .apply()
# Group your dataset by two columns and then sort the aggregated results within the groups.


### CONCLUSIONS
After exploring your dataset, provide a short summary of what you noticed from this dataset. What would you explore further with more time?