The following notebook/kernel is for [COMP 683 Group B](https://www.kaggle.com/ymdahi/comp-683-group-b-project-proposal/notebook).

# Overview
This notebook will apply machine learning concepts to the Stack Overflow Developer Survey data in an effort to analyze and predict factors of **Job Satisfaction**. Specifically, we'll use the data from the  'AssesJob' set of questions to search for patterns for Job Satisfaction.

### Random Forest Classification
Random forest  is a trademark term for an ensemble classifier that consists of many decision trees and outputs the class that is the mode of the classes output by individual trees. 
![](https://i1.wp.com/dataaspirant.com/wp-content/uploads/2017/04/Random-Forest-Introduction.jpg?resize=768%2C384)



In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os # operating system interface
import matplotlib.pyplot as plt

# bring in data source
so_results_file_path = '../input/survey_results_public.csv'
#so_schema_file_path = '../input/survey_results_schema.csv'

# create dataframe to hold results data
df = pd.read_csv(so_results_file_path)

df = df.dropna(subset=['JobSatisfaction'])

## 1.0 Columns of Interest
In this section we'll explain how we are going to approach the data that interests us. The first thing to note about this dataset is that the data is largely qualitative. Given this, we will have to look for opportunities that allow us to more effectively and easily model our data.

### 1.1 Job Satisfaction

> **"Overall, how satisfied are you with your job thus far?"**

Survey respondants were given the following options to select from when answering this question:

* Extremely dissatisfied
* Moderately dissatisfied
* Slightly dissatisfied
* Neither satisfied nor dissatisfied
* Slightly satisfied
* Moderately satisfied
* Extremely satisfied

For the purpose of this assignment, we'll convert these 7 qualitative values to integers 1 through 7, where 1 represents 'Extremely dissatisfied' and 7 represents 'Extremely satisfied'. We'll perform this conversion in the next block:


In [None]:
df['JobSatisfaction'].value_counts().plot.pie()

In [None]:
# Create a pandas column from 'CareerSatisfaction' that converts the qualitative values to quantitative values
JobSatRating = []
for row in df['JobSatisfaction']:
    if row == 'Extremely dissatisfied':
        JobSatRating.append(1)
    elif row == 'Moderately dissatisfied':
        JobSatRating.append(2)
    elif row == 'Slightly dissatisfied':
        JobSatRating.append(3)
    elif row == 'Neither satisfied nor dissatisfied':
        JobSatRating.append(4)
    elif row == 'Slightly satisfied':
        JobSatRating.append(5)
    elif row == 'Moderately satisfied':
        JobSatRating.append(6)
    elif row == 'Extremely satisfied':
        JobSatRating.append(7)
    else:
        JobSatRating.append('Failed') # failed

df['JobSatRating'] = JobSatRating

df['JobSatRating'].describe()

### 1.2 The AssessJob Question Set
While there is alot of data that might be of interest to our model, we are going to be using the set of questions related to "Assessing a potential job opportunity" in our model. The reason for this decision is brevity: if we were to use qualitative data, such as country, degree, favourite frameworks, etc., we would need to perform coding of that data into some workable, quantifiable format. While not impossible, it is out of the scope of this assignment.

That said, the question set selected might suprise us with useful insight into how respondants with low or high Career Satisfaction might evaluate certain job characteristics.

The question set reads: 

>** "Imagine that you are assessing a potential job opportunity. Please rank the following aspects of the job opportunity in order of importance (by dragging the choices up and down), where 1 is the most important and 10 is the least important"**

The 10 options provided to the user to order include:

1. AssessJob1: The **industry** that I'd be working in
* AssessJob2: The **financial performance** or funding status of the company or organization
* AssessJob3: The languages, **frameworks**, and other technologies I'd be working with
* AssessJob4: The compensation and **benefits** offered
* AssessJob5: The office environment or company **culture**
* AssessJob6: The opportunity to work from home/**remotely**
* AssessJob7: Opportunities for **professional development**
* AssessJob8: The **diversity** of the company or organization
* AssessJob9:  How widely used or **impactful** the product or service I'd be working on is
* AssessJob10: **Salary** and/or bonuses

These columns contain a value between 1 and 10 that represents how the respondant ranked the importance of the factor.

In [None]:
# Columns that we are interested in observing.
columns_of_interest = ['AssessJob1','AssessJob2','AssessJob3','AssessJob4','AssessJob5','AssessJob6','AssessJob7','AssessJob8','AssessJob9','AssessJob10','JobSatisfaction','JobSatRating']

# Drop any rows that does not have complete data for the COI above.
clean_df = df[columns_of_interest].dropna()

# Rename the columns in our COI so they are easier to read.
clean_df.columns = ['Industry', 'FinancialStatus','Frameworks','Benefits','Culture','Remote','PD','Diversity','Impact','Salary','JobSatisfaction','JobSatRating']

# The column we want to predict
target_column = ['JobSatRating']

# The columns we will use to model and make prediction
prediction_columns = ['Industry', 'FinancialStatus','Frameworks','Benefits','Culture','Remote','PD','Diversity','Impact','Salary']

# Let's tale a look at our dataframe
clean_df.head()

In [None]:
# Let's take a look at our prediction columns
clean_df[prediction_columns].describe()

In [None]:
# Box Plot for Prediction Columns
clean_df[prediction_columns].boxplot(figsize=(18,10))

In [None]:
clean_df[prediction_columns].hist(figsize=(18,10))

In [None]:
# Count and plot predictors that ranked 3 or lower. i.e. higher importance.
print (clean_df[clean_df[prediction_columns]<=3].count())
print (clean_df[clean_df[prediction_columns]<=3].count().plot.bar())

In [None]:
# Count and plot predictors that ranked 7 or higher. i.e lower importance
print (clean_df[clean_df[prediction_columns]>=7].count())
print (clean_df[clean_df[prediction_columns]>=7].count().plot.bar())

### Pre-Model Analysis of DataFrame
There are a couple of interesting pieces of information to note:

* Most respondents ranked 'Culture' as having the most important in a potential job.
* Also ranked higher in priority: 'Benefits' and 'Diversity'.
* Most respondents ranked 'Impact' as having least importance in a potential job.
* Also, other lower ranked factors included 'Financial Status' and 'Industry'

## 2.0 Modelling, Training, and Predictions
We will use a Random Forest Classifier to train our data and build our mode.

In [None]:
from sklearn.ensemble import RandomForestClassifier
np.random.seed(0)

# Randomly pick some data to be training data.
clean_df['is_train'] = np.random.uniform(0, 1, len(clean_df)) <= .75

# Create two new dataframes, one with the training rows, one with the test rows
train = clean_df[clean_df['is_train']==True]
test = clean_df[clean_df['is_train']==False]

# Show the number of observations for the test and training dataframes
print('Number of observations in the training data:', len(train))
print('Number of observations in the test data:',len(test))

In [None]:
# Remembering our prediction and target columns
print ('What we want to predict: ')
print (target_column)
print ('Factors we will consider when predicting: ')
print (prediction_columns)

In [None]:
clean_df.head(10)

In [None]:
# Create new df to hold training data
y = train[target_column]
y.head()

In [None]:
# Create a random forest Classifier. By convention, clf means 'Classifier'
clf = RandomForestClassifier(n_jobs=2, random_state=0)

# Train the Classifier to take the training features and learn how they relate
# to the training y (the career satisfaction rating)
clf.fit(train[prediction_columns], y)
print (clf.score(train[prediction_columns], y))

In [None]:
# Apply the Classifier we trained to the test data (which, remember, it has never seen before)
print(clf.predict(test[prediction_columns]))

In [None]:
# View the predicted probabilities of the first 10 observations
print(clf.predict_proba(test[prediction_columns])[0:10])

In [None]:
preds = clf.predict(test[prediction_columns])
print('Predictions for first 5 elements in test df:')
print(preds[0:5])
print('Actual values for first 5 elements in test df:')
print(test['JobSatRating'].head())

In [None]:
# Create confusion matrix
cm = pd.crosstab(test['JobSatRating'], preds, rownames=['Actual JobSatisfaction'], colnames=['Predicted JobSatisfaction'])
cm

In [None]:
cm.plot.bar(figsize=(18,10))

In [None]:
cm.plot(kind="bar", figsize=(8,8),stacked=True)

In [None]:
# View a list of the features and their importance scores
imp = list(zip(train[prediction_columns], clf.feature_importances_))
imp

In [None]:
std = np.std([tree.feature_importances_ for tree in clf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(train[prediction_columns].shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(train[prediction_columns].shape[1]), importances[indices],
       color="grey", yerr=std[indices], align="center")
plt.xticks(range(train[prediction_columns].shape[1]), indices)
plt.xlim([-1, train[prediction_columns].shape[1]])
plt.show()

### Links
* https://www.analyticsvidhya.com/blog/2017/09/common-machine-learning-algorithms/
* https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/