# SCENARIO 

In SY 20-21, TEA required all Texas LEAs (Local Education Agencies such as a school district) to administer either the TX-KEA or mCLASS assessment to students in order to measure grade level readiness. The Tx-KEA and mCLASS diagnostics employ different scales for scoring and evaluate different sets of skills. Initial analysis of Fall 2020 data indicates a 38 percentage point gap in grade level readiness between the two tests; 76% of students who took Tx-KEA were found to be grade level ready compared to just 38% of students who took mCLASS. 

The task is to extract, analyze, and present data that will help the Educators understand to what extent the difference in readiness as measured by the tests is due to differences in the underlying populations of students taking each (as opposed to differences in test design and scoring). 

Analysis will be limited to students taking the English version of each diagnostic to avoid complications that arise from differences in the English and Spanish versions of mCLASS.


Database Tables:

1)MCLASS – student level performance on mCLASS diagnostic assessment.
•Field ‘assessment_edition’ indicates whether the student took the English version of thetest (DIBELS) or the Spanish version (IDEL).
•For field ‘composite_level’ values of ‘At Benchmark’ or ‘Above Benchmark’ indicates grade level readiness.

2)TXKEA – student level performance on Tx-KEA diagnostic assessment.
•Field ‘language’ indicates whether the student took the English or Spanish version
•For field ‘lit_screening_benchmark’ value of ‘On-Track’ indicates grade level readiness.

3)DEMO – student level demographic data.
•Field ‘eco’ indicates whether the student is identified as ‘economically disadvantaged’
•Field ‘spec_ed’ indicates whether the student receives special education services
•Field ‘el’ indicates whether the student is identified as an English learner



In [71]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


The first thing I want to do is take a look at the shape of the data files so I can better decide how I should choose which join to perform.

In [72]:
demo = pd.read_csv('Data/DEMO.csv')
demo.head()

  demo = pd.read_csv('Data/DEMO.csv')


Unnamed: 0,student_id,district_id,ethnicity,eco,el,spec_ed
0,97840593,798403,Black or African American,YES,NO,NO
1,885938600,53405,White,YES,NO,NO
2,871944576,798403,Black or African American,YES,NO,NO
3,818725252,53405,White,NO,NO,NO
4,702015143,800409,White,YES,NO,YES


In [73]:
demo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 174348 entries, 0 to 174347
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   student_id   174348 non-null  object
 1   district_id  174348 non-null  int64 
 2   ethnicity    174348 non-null  object
 3   eco          174348 non-null  object
 4   el           174348 non-null  object
 5   spec_ed      174348 non-null  object
dtypes: int64(1), object(5)
memory usage: 8.0+ MB


In [74]:
txkea = pd.read_csv('Data/TXKEA.csv')
txkea.head()

Unnamed: 0,district_id,student_id,language,lit_screening_benchmark,lit_screening_score,date
0,70408,8878861576,English,Monitor,15,9/15/2020
1,808486,8878825752,English,On-Track,29,11/30/2020
2,801404,8878799239,English,Support,14,10/12/2020
3,808488,8878794629,English,On-Track,35,11/6/2020
4,808486,8878745384,English,On-Track,23,10/16/2020


In [75]:
txkea.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112048 entries, 0 to 112047
Data columns (total 6 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   district_id              112048 non-null  int64 
 1   student_id               112048 non-null  int64 
 2   language                 112048 non-null  object
 3   lit_screening_benchmark  112048 non-null  object
 4   lit_screening_score      112048 non-null  int64 
 5   date                     112048 non-null  object
dtypes: int64(3), object(3)
memory usage: 5.1+ MB


After a cursory look at the data, the DEMO dataframe has the most rows and no null values so I will join the other two onto it merging the student_id columns.

In [76]:
txkea =txkea.merge(demo, on='student_id', how='left')

In [77]:
txkea.head()

Unnamed: 0,district_id_x,student_id,language,lit_screening_benchmark,lit_screening_score,date,district_id_y,ethnicity,eco,el,spec_ed
0,70408,8878861576,English,Monitor,15,9/15/2020,70408.0,Hispanic/Latino,NO,NO,NO
1,808486,8878825752,English,On-Track,29,11/30/2020,808486.0,White,NO,NO,NO
2,801404,8878799239,English,Support,14,10/12/2020,801404.0,Hispanic/Latino,NO,NO,NO
3,808488,8878794629,English,On-Track,35,11/6/2020,808488.0,Hispanic/Latino,YES,NO,NO
4,808486,8878745384,English,On-Track,23,10/16/2020,808486.0,Hispanic/Latino,YES,NO,NO


In [78]:
txkea.drop('district_id_y',axis=1,inplace=True)
txkea.head()

Unnamed: 0,district_id_x,student_id,language,lit_screening_benchmark,lit_screening_score,date,ethnicity,eco,el,spec_ed
0,70408,8878861576,English,Monitor,15,9/15/2020,Hispanic/Latino,NO,NO,NO
1,808486,8878825752,English,On-Track,29,11/30/2020,White,NO,NO,NO
2,801404,8878799239,English,Support,14,10/12/2020,Hispanic/Latino,NO,NO,NO
3,808488,8878794629,English,On-Track,35,11/6/2020,Hispanic/Latino,YES,NO,NO
4,808486,8878745384,English,On-Track,23,10/16/2020,Hispanic/Latino,YES,NO,NO


# Component 1 - Drop Spanish language versions of the test

We must limit the analysis to students taking the English version of each diagnostic to avoid complications that arise from differences in the English and Spanish versions of mCLASS.

In [79]:
#checking values
txkea.language.value_counts()

English    94558
Spanish    17515
Name: language, dtype: int64

In [80]:
txkea = txkea[txkea['language'] == 'English']

In [81]:
txkea.language.value_counts()

English    94558
Name: language, dtype: int64

In [82]:
txkea.drop('language',axis=1,inplace=True)


In [83]:
txkea.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 94558 entries, 0 to 112072
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   district_id_x            94558 non-null  int64 
 1   student_id               94558 non-null  object
 2   lit_screening_benchmark  94558 non-null  object
 3   lit_screening_score      94558 non-null  int64 
 4   date                     94558 non-null  object
 5   ethnicity                71030 non-null  object
 6   eco                      71030 non-null  object
 7   el                       71030 non-null  object
 8   spec_ed                  71030 non-null  object
dtypes: int64(2), object(7)
memory usage: 7.2+ MB


In [84]:
txkea.head()

Unnamed: 0,district_id_x,student_id,lit_screening_benchmark,lit_screening_score,date,ethnicity,eco,el,spec_ed
0,70408,8878861576,Monitor,15,9/15/2020,Hispanic/Latino,NO,NO,NO
1,808486,8878825752,On-Track,29,11/30/2020,White,NO,NO,NO
2,801404,8878799239,Support,14,10/12/2020,Hispanic/Latino,NO,NO,NO
3,808488,8878794629,On-Track,35,11/6/2020,Hispanic/Latino,YES,NO,NO
4,808486,8878745384,On-Track,23,10/16/2020,Hispanic/Latino,YES,NO,NO


# Ensure that no rows have values for both MCLASS and TXKEA assessments

# Assessment Scoring

Let's take a quick look at each assessments outcome.

In [85]:
txkea.lit_screening_benchmark.value_counts()

On-Track    71152
Support     14468
Monitor      8938
Name: lit_screening_benchmark, dtype: int64

# Number Of Missing Values By Column

There are many NaN values. Let's count the number of missing values in each column and sort them.

In [86]:

missing = pd.concat([txkea.isnull().sum(), 100 * txkea.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by='count')

Unnamed: 0,count,%
district_id_x,0,0.0
student_id,0,0.0
lit_screening_benchmark,0,0.0
lit_screening_score,0,0.0
date,0,0.0
ethnicity,23528,24.882083
eco,23528,24.882083
el,23528,24.882083
spec_ed,23528,24.882083


In [87]:
txkea.dropna(subset=['ethnicity'],inplace=True)

In [88]:
missing = pd.concat([txkea.isnull().sum(), 100 * txkea.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by='count')

Unnamed: 0,count,%
district_id_x,0,0.0
student_id,0,0.0
lit_screening_benchmark,0,0.0
lit_screening_score,0,0.0
date,0,0.0
ethnicity,0,0.0
eco,0,0.0
el,0,0.0
spec_ed,0,0.0


In [89]:
txkea.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 71030 entries, 0 to 98027
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   district_id_x            71030 non-null  int64 
 1   student_id               71030 non-null  object
 2   lit_screening_benchmark  71030 non-null  object
 3   lit_screening_score      71030 non-null  int64 
 4   date                     71030 non-null  object
 5   ethnicity                71030 non-null  object
 6   eco                      71030 non-null  object
 7   el                       71030 non-null  object
 8   spec_ed                  71030 non-null  object
dtypes: int64(2), object(7)
memory usage: 5.4+ MB


In [90]:
txkea.to_csv('txkea_clean.csv')