**Hypothesis: Do STEM fields have a statistically significant difference in the number of A grades compared to humanities?**

In [38]:
import pandas as pd
import numpy as np
from scipy import stats
subjects = pd.read_csv('subjects.csv')
subjects_code = pd.read_csv('subject_memberships.csv')
grade_distributions = pd.read_csv('grade_distributions.csv')
subjects = pd.DataFrame(subjects)
subjects_code = pd.DataFrame(subjects_code)
grade_distributions = pd.DataFrame(grade_distributions)

In [39]:
subjects = subjects.rename(columns={"code": "subject_code"})
subjects_code['subject_code'] = subjects_code['subject_code'].astype(str)

subjects.set_index('subject_code', inplace=True)
subjects_code.set_index('subject_code', inplace=True)

subjects1 = subjects.join(subjects_code, how='outer', on='subject_code')


In [40]:
grade_distributions.set_index('course_offering_uuid', inplace=True)
subjects1.set_index('course_offering_uuid', inplace=True)

- I cleaned my data by joining the numerous datasets, and dropping the null values which made up less thn 1% of the data

In [41]:
df = subjects1.join(grade_distributions, how='outer')

In [42]:
df = df.reset_index()

- I manually created two lists for STEM and Humanities subjects.

In [43]:
STEM_subjects = ['chemical and biological engineering', 'engineering physics', 'biology', 'biochemistry', 
                 'mechanical engineering', 'biomolecular chemistry', 'biomedical engineering', 'chemistry', 
                 'medical physics', 'statistics', 'medicine', 'engineering mechanics and astronautics', 
                 'computer sciences']
Humanities = ['art history', 'history of science', 'history', 'geography', 'english as a second language',
              'african languages and literature', 'asian languages and cultures', 'asian languages and cultures: languages',
             'comparative literature', 'literature in translation', 'law', 'political science', 'philosophy']

In [44]:
df['No_Student'] = df.iloc[:, 4:-1].sum(axis=1)

In [45]:
df = df.dropna()
df = df.iloc[:,[1,4,-1]]

In [46]:
df = df.groupby(['name']).sum().reset_index()

In [47]:
df['STEM'] = df.name.apply(lambda x: 1 if x.lower() in STEM_subjects else 0)
df['Humanities'] = df.name.apply(lambda x: 1 if x.lower() in Humanities else 0)
df['to_drop'] = df['STEM']+df['Humanities']

-  I dropped all subjects that were not STEM or Humanities subjects

In [48]:
df = df[df.to_drop == 1]
df = df.iloc[:,:-1]

In [49]:
df = df.groupby(['Humanities','STEM']).sum().reset_index()

In [50]:
df['perc_a'] = df['a_count']/df['No_Student']

In [51]:
STEM = df.iloc[0:1,]
Humanities = df.iloc[1:2,]

In [52]:
zeros = np.zeros(shape=(int(Humanities.No_Student.sum() - Humanities.a_count.sum())))
ones = np.ones(shape=(int(Humanities.a_count.sum())))
hum_array = np.append(zeros,ones)
hum_array.mean()

0.3067382051119961

In [53]:
zeros1 = np.zeros(shape=(int(STEM.No_Student.sum() - STEM.a_count.sum())))
ones1 = np.ones(shape=(int(STEM.a_count.sum())))
stem_array = np.append(zeros1,ones1)
stem_array.mean()

0.3252248845725203

In [54]:
df

Unnamed: 0,Humanities,STEM,a_count,No_Student,perc_a
0,0,1,163419.0,502480.0,0.325225
1,1,0,116455.0,379656.0,0.306738


- I summed up all the data for the various subjects and added a column summarising the percentage of A's for each subject.
- I then created an array for each subject to help conduct a t-test

In [55]:
results = stats.ttest_ind(hum_array, stem_array,equal_var = False)

In [56]:
results

Ttest_indResult(statistic=-18.515814619584372, pvalue=1.5957064858066443e-76)

- This is a two-sided Welch test for the null hypothesis that 2 independent samples have identical average (expected) values.
- We can reject the null hypothesis and therefore state that there is a statistical significance between the two sample data sets, as the p-value is below our preset alpha of 0.05.

- **This tells us that for the average student that wants to get a higher grade and is choosing between taking a STEM subject or a Humanities subject, they should choose a STEM subject.**