The following notebook/kernel is for [COMP 683 Group B](https://www.kaggle.com/ymdahi/comp-683-group-b-project-proposal/notebook).

# Goals and Objectives
This notebook will apply machine learning concepts to the Stack Overflow Developer Survey data in an effort to analyze and predict factors of _career satisfaction_.

Specifically, we will use supervised machine learning algorithms, such as  Regression and Tree-based models,  to predict and forecast career satisfaction by learning patterns from the Stack Overflow dataset. Further, we will validate and evaluate the models used to better describe their efficacy in decision-making related to determining career satisfaction. We will use, where applicable, visualizations of the output to describe our findings.

## Getting started
We begin by establishing our machine learning tools, environment, and data sources. 

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os # operating system interface
import matplotlib.pyplot as plt

# bring in data source
so_results_file_path = '../input/survey_results_public.csv'
#so_schema_file_path = '../input/survey_results_schema.csv'

# create dataframe to hold results data
df = pd.read_csv(so_results_file_path)

df = df.dropna(subset=['CareerSatisfaction'])

## Linear Regression Analysis
In our first analysis, we will apply a linear regression approach to modelling the relationship between our inputs and career satisfaction. This type of modelling works well estimating real values (i.e. price, cost, quantities). 

## Decision-tree Analysis

## The Shape of the Data
As stated in the the intro, we know there are 98,855 responded and 129 questions.

TODO:
* find number of incomplete entries
* split dataframe to not include these entries so only using complete submissions (or maybe not????)

In [None]:
print ( 'Number of rows (respondants): ' , (df.shape[0]) )
print ( 'Number of columns (questions): ' , (df.shape[1]) )

# 3.0 Columns of Interest

In this section we'll explain how we are going to approach the data that interests us. The first thing to note about this dataset is that the data is largely qualitative. Given this, we will have to look for opportunities that allow us to more effectively and easily model our data.

### 3.1 Career Satisfaction
CareerSatisfaction is the target for our predictions and modelling. It is a qualitative column. The question in the survey for CareerSatisfaction reads:

> **"Overall, how satisfied are you with your career thus far?"**

Survey respondants were given the following options to select from when answering this question:

* Extremely dissatisfied
* Moderately dissatisfied
* Slightly dissatisfied
* Neither satisfied nor dissatisfied
* Slightly satisfied
* Moderately satisfied
* Extremely satisfied

For the purpose of this assignment, we'll convert these 7 qualitative values to integers 1 through 7, where 1 represents 'Extremely dissatisfied' and 7 represents 'Extremely satisfied'. We'll perform this conversion in the next block:


In [None]:
df['CareerSatisfaction'].value_counts()

In [None]:
df['CareerSatisfaction'].value_counts().plot.pie()

#### 3.1.1 Convert Qualitative Data

In [None]:
# Create a pandas column from 'CareerSatisfaction' that converts the qualitative values to quantitative values
CareerSatRating = []
for row in df['CareerSatisfaction']:
    if row == 'Extremely dissatisfied':
        CareerSatRating.append(1)
    elif row == 'Moderately dissatisfied':
        CareerSatRating.append(2)
    elif row == 'Slightly dissatisfied':
        CareerSatRating.append(3)
    elif row == 'Neither satisfied nor dissatisfied':
        CareerSatRating.append(4)
    elif row == 'Slightly satisfied':
        CareerSatRating.append(5)
    elif row == 'Moderately satisfied':
        CareerSatRating.append(6)
    elif row == 'Extremely satisfied':
        CareerSatRating.append(7)
    else:
        CareerSatRating.append('Failed') # failed

df['CareerSatRating'] = CareerSatRating

df['CareerSatRating'].head(5)


### 3.2 Question: Assessing a potential job opportunity
While there is alot of data that might be of interest to our model, we are going to be using the set of questions related to "Assessing a potential job opportunity" in our model. The reason for this decision is brevity: if we were to use qualitative data, such as country, degree, favourite frameworks, etc., we would need to perform coding of that data into some workable, quantifiable format. While not impossible, it is out of the scope of this assignment.

That said, the question set selected might suprise us with useful insight into how respondants with low or high Career Satisfaction might evaluate certain job characteristics.

The question set reads: 

>** "Imagine that you are assessing a potential job opportunity. Please rank the following aspects of the job opportunity in order of importance (by dragging the choices up and down), where 1 is the most important and 10 is the least important"**

The 10 options provided to the user to order include:

1. AssessJob1: The industry that I'd be working in
* AssessJob2: The financial performance or funding status of the company or organization
* AssessJob3: The languages, frameworks, and other technologies I'd be working with
* AssessJob4: The compensation and benefits offered
* AssessJob5: The office environment or company culture
* AssessJob6: The opportunity to work from home/remotely
* AssessJob7: Opportunities for professional development
* AssessJob8: The diversity of the company or organization
* AssessJob9:  How widely used or impactful the product or service I'd be working on is
* AssessJob10: Salary and/or bonuses

These columns contain a value between 1 and 10 that represents how the respondant ranked the importance of the factor.

### 3.3 Columns of Interest (COI)
We now have our COI selected. A couple of interesting points already:

* The average value for Career Satisfaction is 5.2, or 'Slightly satisfied'
* On average most respondants ranked this factor as having greater important: AssessJob9: 'How widely used or impactful the product or service I'd be working on is'
* On average most respondants ranked this factor as having lesser importance: AssessJob5: 'The office environment or company culture'

In [None]:
columns_of_interest = ['AssessJob1','AssessJob2','AssessJob3','AssessJob4','AssessJob5','AssessJob6','AssessJob7','AssessJob8','AssessJob9','AssessJob10','CareerSatisfaction','CareerSatRating']
selected_columns = df[columns_of_interest].dropna()
selected_columns.columns = ['Industry', 'FinancialStatus','Frameworks','Benefits','Culture','Remote','PD','Diversity','Impact','Salary','CareerSatisfaction','CareerSatRating']
target_column = selected_columns['CareerSatRating']
prediction_columns = ['Industry', 'FinancialStatus','Frameworks','Benefits','Culture','Remote','PD','Diversity','Impact','Salary']
selected_columns[prediction_columns].describe()

In [None]:
selected_columns[prediction_columns].info()

In [None]:
selected_columns[prediction_columns].boxplot(figsize=(18,10))

In [None]:
selected_columns[prediction_columns].hist(figsize=(18,10))

In [None]:
selected_columns.groupby('CareerSatRating').hist(figsize=(18,10))

## 4.0 Modelling, Training, and Predictions

In [None]:
from sklearn.ensemble import RandomForestClassifier
np.random.seed(0)

selected_columns['is_train'] = np.random.uniform(0, 1, len(selected_columns)) <= .75
selected_columns.head(10)

In [None]:
# Create two new dataframes, one with the training rows, one with the test rows
train = selected_columns[selected_columns['is_train']==True]
test = selected_columns[selected_columns['is_train']==False]
# Show the number of observations for the test and training dataframes
print('Number of observations in the training data:', len(train))
print('Number of observations in the test data:',len(test))

In [None]:
# Create a list of the feature column's names
features = selected_columns.columns[:10]
features

In [None]:
# train['ConvCareerSatisfaction'] contains the actual species names. Before we can use it,
# we need to convert each species name into a digit. So, in this case there
# are three species, which have been coded as 0, 1, or 2.
y = train['CareerSatRating']
y.head()

In [None]:
# Create a random forest Classifier. By convention, clf means 'Classifier'
clf = RandomForestClassifier(n_jobs=2, random_state=0)

# Train the Classifier to take the training features and learn how they relate
# to the training y (the career satisfaction rating)
clf.fit(train[features], y)
print (clf.score(train[features], y))

In [None]:
# Apply the Classifier we trained to the test data (which, remember, it has never seen before)
print(clf.predict(test[features]))

In [None]:
# View the predicted probabilities of the first 10 observations
print(clf.predict_proba(test[features])[0:10])

In [None]:
preds = clf.predict(test[features])
print('Predictions for first 5 elements in test df:')
print(preds[0:5])
print('Actual values for first 5 elements in test df:')
print(test['CareerSatRating'].head())

In [None]:
# Create confusion matrix
cm = pd.crosstab(test['CareerSatRating'], preds, rownames=['Actual CareerSatisfaction'], colnames=['Predicted CareerSatisfaction'])
cm

In [None]:
cm.plot(kind="bar", figsize=(8,8),stacked=True)

In [None]:
# View a list of the features and their importance scores
imp = list(zip(train[features], clf.feature_importances_))
imp

In [None]:
importances = clf.feature_importances_
importances

In [None]:
std = np.std([tree.feature_importances_ for tree in clf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(train[features].shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(train[features].shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(train[features].shape[1]), indices)
plt.xlim([-1, train[features].shape[1]])
plt.show()

### Links
* https://www.analyticsvidhya.com/blog/2017/09/common-machine-learning-algorithms/
* https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/