# Prediction: 

Goal: To be able to predict if a new coder is a female or male according to their university major. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier

# Loading Data


In [2]:
all_df = pd.read_csv("2016_New_Coders_Survey.csv", low_memory=False)

In [3]:
# creating a new df with only "female" 
female_df = all_df[all_df["Gender"] == "female"]

# create a new df with only "male"
male_df = all_df[all_df["Gender"] == "male"]

# create a new df for other genders
lgbtq_plus_df = all_df[all_df["Gender"].apply(lambda x: x not in ["female", "male"])]

In [4]:
# Make a new dataframe that just has Degree and Gender columns
all_df = all_df[["SchoolMajor", "Gender"]]

In [5]:
# Create a list of top ten majors for each gender group
top_majors_female = set(female_df["SchoolMajor"].value_counts(normalize = True)[0:10].index.values)

In [6]:
top_majors_male = set(male_df["SchoolMajor"].value_counts(normalize = True)[0:10].index.values)
top_majors_male

{'Business Administration',
 'Computer Programming',
 'Computer Science',
 'Economics',
 'Electrical Engineering',
 'Electrical and Electronics Engineering',
 'Engineering',
 'Information Technology',
 'Mechanical Engineering',
 'Software Engineering'}

In [7]:
top_majors_lgbtq_plus = set(lgbtq_plus_df["SchoolMajor"].value_counts(normalize = True)[0:10].index.values)
top_majors_lgbtq_plus

{'Business Administration',
 'Communication and Media Studies',
 'Computer Science',
 'Engineering',
 'English',
 'Information Technology',
 'Math',
 'Music',
 'Philosophy',
 'Psychology'}

In [8]:
# create a set with all three top 10 majors without repitition using .union
top_majors = top_majors_female.union(top_majors_male).union(top_majors_lgbtq_plus)

In [9]:
# Change the names of all the majors in our train and test dataset that are not in our top ten
# using list comprehension.
# Getting truth values: using lambda functions to find the truths and use list comp to get the
# true values outside of the data frame
truth_values = all_df["SchoolMajor"].apply(lambda x: x not in top_majors)

all_df["SchoolMajor"][truth_values] = "Other"

In [10]:
all_df["SchoolMajor"].unique()

array(['Other', 'English', 'Computer Science', 'Business Administration',
       'Mechanical Engineering', 'Math', 'Economics', 'Music',
       'Political Science', 'Information Technology', 'Graphic Design',
       'Liberal Arts', 'Electrical Engineering', 'Biology', 'Engineering',
       'Psychology', 'Philosophy', 'Software Engineering',
       'Computer Programming', 'Electrical and Electronics Engineering',
       'Communication and Media Studies'], dtype=object)

In [11]:
# dumbify
# pandas dummy variables
majors_df = pd.get_dummies(all_df['SchoolMajor'])
all_df = pd.concat([majors_df, all_df], axis = 1)

In [12]:
# drop column "SchoolMajor"
all_df = all_df.drop('SchoolMajor', 1)

In [13]:
all_df.head()

Unnamed: 0,Biology,Business Administration,Communication and Media Studies,Computer Programming,Computer Science,Economics,Electrical Engineering,Electrical and Electronics Engineering,Engineering,English,...,Liberal Arts,Math,Mechanical Engineering,Music,Other,Philosophy,Political Science,Psychology,Software Engineering,Gender
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,male
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,male
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,male
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,female
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,female


In [14]:
# create an array
all_df_arr = all_df.values

In [15]:
features = all_df_arr[0::, 0:-1]
target = list(all_df_arr[0::, -1])

In [16]:
clf1 = DecisionTreeClassifier(random_state=0)
cross_val_score(clf1, features, target, cv=3)

array([ 0.6891321 ,  0.68836406,  0.68908532])

We can predict with 68% accuracy whether a new coder is female or male just based on their college major. This is significant because if more men have degrees related to CS and engineering then they have a better foundation and technical backgroup for a programming career. From applying to competative bootcamps who require technical backgrounds to an interview where both candidates have been studying programming for the same amount of time and have similar skills, but one has a degree related to CS and the other has a degree in English. This puts women who are new coders at a great disadvantage. In order to increase the number of female programmers, we need to look at this information and find good solutions to level the playing field for this generation of women who already have their first degree in an unrelated field, but have a desire to learn programming and work in high tech. 

I plan on adding hackathons and bootcamps as my next features to see if this increases my accuracy. My assumption is that primarily men attend these types of career building activities and that we then need to look at how to get more women involved. 