# SCENARIO 

In SY 20-21, TEA required all Texas LEAs (Local Education Agencies such as a school district) to administer either the TX-KEA or mCLASS assessment to students in order to measure grade level readiness. The Tx-KEA and mCLASS diagnostics employ different scales for scoring and evaluate different sets of skills. Initial analysis of Fall 2020 data indicates a 38 percentage point gap in grade level readiness between the two tests; 76% of students who took Tx-KEA were found to be grade level ready compared to just 38% of students who took mCLASS. 

The task is to extract, analyze, and present data that will help the Educators understand to what extent the difference in readiness as measured by the tests is due to differences in the underlying populations of students taking each (as opposed to differences in test design and scoring). 

Analysis will be limited to students taking the English version of each diagnostic to avoid complications that arise from differences in the English and Spanish versions of mCLASS.


Database Tables:

1)MCLASS – student level performance on mCLASS diagnostic assessment.
•Field ‘assessment_edition’ indicates whether the student took the English version of thetest (DIBELS) or the Spanish version (IDEL).
•For field ‘composite_level’ values of ‘At Benchmark’ or ‘Above Benchmark’ indicates grade level readiness.

2)TXKEA – student level performance on Tx-KEA diagnostic assessment.
•Field ‘language’ indicates whether the student took the English or Spanish version
•For field ‘lit_screening_benchmark’ value of ‘On-Track’ indicates grade level readiness.

3)DEMO – student level demographic data.
•Field ‘eco’ indicates whether the student is identified as ‘economically disadvantaged’
•Field ‘spec_ed’ indicates whether the student receives special education services
•Field ‘el’ indicates whether the student is identified as an English learner



In [166]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


The first thing I want to do is take a look at the shape of the data files so I can better decide how I should choose which join to perform.

In [167]:
demo = pd.read_csv('Data/DEMO.csv')
demo.head()

  demo = pd.read_csv('Data/DEMO.csv')


Unnamed: 0,student_id,district_id,ethnicity,eco,el,spec_ed
0,97840593,798403,Black or African American,YES,NO,NO
1,885938600,53405,White,YES,NO,NO
2,871944576,798403,Black or African American,YES,NO,NO
3,818725252,53405,White,NO,NO,NO
4,702015143,800409,White,YES,NO,YES


In [168]:
demo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 174348 entries, 0 to 174347
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   student_id   174348 non-null  object
 1   district_id  174348 non-null  int64 
 2   ethnicity    174348 non-null  object
 3   eco          174348 non-null  object
 4   el           174348 non-null  object
 5   spec_ed      174348 non-null  object
dtypes: int64(1), object(5)
memory usage: 8.0+ MB


In [169]:
mclass = pd.read_csv('Data/MCLASS.csv')
mclass.head()

Unnamed: 0,student_id,district_id,school_id,assessment_edition,composite_level,composite_score
0,8878547139,806405.0,806405802.0,DIBELS 8th Edition,At Benchmark,306.0
1,8878132753,818408.0,818408807.0,DIBELS 8th Edition,Below Benchmark,291.0
2,8877357966,,,DIBELS 8th Edition,At Benchmark,314.0
3,8877359986,820405.0,820405805.0,DIBELS 8th Edition,At Benchmark,326.0
4,8877961413,820407.0,820407871.0,DIBELS 8th Edition,At Benchmark,308.0


In [170]:
mclass.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63268 entries, 0 to 63267
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   student_id          63268 non-null  int64  
 1   district_id         57028 non-null  float64
 2   school_id           56676 non-null  object 
 3   assessment_edition  63268 non-null  object 
 4   composite_level     63268 non-null  object 
 5   composite_score     56124 non-null  float64
dtypes: float64(2), int64(1), object(3)
memory usage: 2.9+ MB


After a cursory look at the data, the DEMO dataframe has the most rows and no null values so I will join the other two onto it merging the student_id columns.

In [171]:
mclass = mclass.merge(demo, on='student_id', how='left')

In [172]:
mclass.head()

Unnamed: 0,student_id,district_id_x,school_id,assessment_edition,composite_level,composite_score,district_id_y,ethnicity,eco,el,spec_ed
0,8878547139,806405.0,806405802.0,DIBELS 8th Edition,At Benchmark,306.0,806405.0,White,NO,NO,NO
1,8878132753,818408.0,818408807.0,DIBELS 8th Edition,Below Benchmark,291.0,818408.0,White,YES,NO,NO
2,8877357966,,,DIBELS 8th Edition,At Benchmark,314.0,95408.0,White,NO,NO,NO
3,8877359986,820405.0,820405805.0,DIBELS 8th Edition,At Benchmark,326.0,820405.0,White,NO,NO,NO
4,8877961413,820407.0,820407871.0,DIBELS 8th Edition,At Benchmark,308.0,820407.0,White,YES,NO,NO


# Number Of Missing Values By Column

There are many NaN values. Let's count the number of missing values in each column and sort them.

In [173]:

missing = pd.concat([mclass.isnull().sum(), 100 * mclass.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by='count')

Unnamed: 0,count,%
student_id,0,0.0
assessment_edition,0,0.0
composite_level,0,0.0
district_id_x,6241,9.861113
school_id,6593,10.417292
composite_score,7145,11.289482
district_id_y,15689,24.789458
ethnicity,15689,24.789458
eco,15689,24.789458
el,15689,24.789458


In [174]:
mclass.dropna(subset=['ethnicity'],inplace=True)

In [175]:
missing = pd.concat([mclass.isnull().sum(), 100 * mclass.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by='count')

Unnamed: 0,count,%
student_id,0,0.0
assessment_edition,0,0.0
composite_level,0,0.0
district_id_y,0,0.0
ethnicity,0,0.0
eco,0,0.0
el,0,0.0
spec_ed,0,0.0
district_id_x,4676,9.823529
school_id,4942,10.382353


# Component 1 - Drop Spanish language versions of the test

We must limit the analysis to students taking the English version of each diagnostic to avoid complications that arise from differences in the English and Spanish versions of mCLASS.

In [176]:
#checking values
mclass.assessment_edition.value_counts()

DIBELS 8th Edition        42182
IDEL Standard 3 Period     5418
Name: assessment_edition, dtype: int64

In [177]:
#droping Spanish versions
mclass.drop(mclass.loc[mclass['assessment_edition']=='IDEL Standard 3 Period'].index,inplace=True)

In [178]:
mclass.assessment_edition.value_counts()

DIBELS 8th Edition    42182
Name: assessment_edition, dtype: int64

In [179]:
mclass.drop('assessment_edition', axis=1,inplace=True)


In [180]:
mclass.head()

Unnamed: 0,student_id,district_id_x,school_id,composite_level,composite_score,district_id_y,ethnicity,eco,el,spec_ed
0,8878547139,806405.0,806405802.0,At Benchmark,306.0,806405.0,White,NO,NO,NO
1,8878132753,818408.0,818408807.0,Below Benchmark,291.0,818408.0,White,YES,NO,NO
2,8877357966,,,At Benchmark,314.0,95408.0,White,NO,NO,NO
3,8877359986,820405.0,820405805.0,At Benchmark,326.0,820405.0,White,NO,NO,NO
4,8877961413,820407.0,820407871.0,At Benchmark,308.0,820407.0,White,YES,NO,NO


In [181]:
mclass.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42182 entries, 0 to 63236
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   student_id       42182 non-null  object 
 1   district_id_x    38222 non-null  float64
 2   school_id        37956 non-null  object 
 3   composite_level  42182 non-null  object 
 4   composite_score  42182 non-null  float64
 5   district_id_y    42182 non-null  float64
 6   ethnicity        42182 non-null  object 
 7   eco              42182 non-null  object 
 8   el               42182 non-null  object 
 9   spec_ed          42182 non-null  object 
dtypes: float64(3), object(7)
memory usage: 3.5+ MB


In [182]:
missing = pd.concat([mclass.isnull().sum(), 100 * mclass.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by='count')

Unnamed: 0,count,%
student_id,0,0.0
composite_level,0,0.0
composite_score,0,0.0
district_id_y,0,0.0
ethnicity,0,0.0
eco,0,0.0
el,0,0.0
spec_ed,0,0.0
district_id_x,3960,9.387891
school_id,4226,10.018491


In [183]:
mclass.drop('district_id_x',axis=1,inplace=True)
mclass.head()

Unnamed: 0,student_id,school_id,composite_level,composite_score,district_id_y,ethnicity,eco,el,spec_ed
0,8878547139,806405802.0,At Benchmark,306.0,806405.0,White,NO,NO,NO
1,8878132753,818408807.0,Below Benchmark,291.0,818408.0,White,YES,NO,NO
2,8877357966,,At Benchmark,314.0,95408.0,White,NO,NO,NO
3,8877359986,820405805.0,At Benchmark,326.0,820405.0,White,NO,NO,NO
4,8877961413,820407871.0,At Benchmark,308.0,820407.0,White,YES,NO,NO


In [184]:

missing = pd.concat([mclass.isnull().sum(), 100 * mclass.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by='count')

Unnamed: 0,count,%
student_id,0,0.0
composite_level,0,0.0
composite_score,0,0.0
district_id_y,0,0.0
ethnicity,0,0.0
eco,0,0.0
el,0,0.0
spec_ed,0,0.0
school_id,4226,10.018491


# Assessment Scoring

Let's take a quick look at each assessments outcome.

In [185]:
both_mclass_totals = mclass.composite_level.value_counts()
print(both_mclass_totals)


Well Below Benchmark    18913
Above Benchmark          8620
Below Benchmark          8100
At Benchmark             6549
Name: composite_level, dtype: int64


In [188]:
mclass.to_csv('mclass_clean.csv')