# Part 2

In Part 1 of this assignment, we looked at Linear Regression and built a model from scratch. In practice, we usually do not build models from scratch, but use already available ones. Scikit-Learn provides a vast variety of models. You can have a look at a few different options available here (completely optional): <br>
* Overall Supervised Learning: https://scikit-learn.org/stable/supervised_learning.html <br>
* Classification: https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html <br> 

Feel free to explore other models available in scikit learn. 

### Context

The tech industry has observed a tremendous rise over the past decade whereby its growing advancements have made it a dream workplace for people around the globe. Consequently, a large influx of employees is observed in the tech-industry every year. However, the competitive and fast-paced environment does offer its own consequences. It is time to analyze the prevalence of mental health issues faced by employees in tech companies and the factors affecting them.

In this Part, we will use a dataset called **Mental Health in Tech Companies** and check if we could use features related to work environment and other work-related factors of a tech employ to predict whether they suffer from mental health issues or not. 

In [None]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import pyplot
import seaborn as sns

**Question 0 (a):** To make things easier, you have been provided with cleaned dataset in data_tech.pkl (a pickle file). Read the file into the dataframe data_tech.

**Note:** You may need to further clean and modify the dataframe based on your model/choice of features.  

In [None]:
data_tech = ...
data_tech.head(5)

**Question 0 (b):** Since selecting features is an important part of creating of your model, you should have a clear idea about the different columns in the dataset. Read the column descriptions from the file columns.csv into the dataframe columns and understand them. Now analyze them in data_tech using an relevant techniques. You may want to do some additional data cleaning and EDA 

**Hint:** functions like .info(), .describe() etc might be helpful. You can also have a look at the concepts used in Assignment 2/EDA Phase

In [None]:
columns = ...
# Write EDA code here

Now that you have familiarzied yourself with the dataset, it is time to create your model!

### Feature Engineering

We want to make use of our features to predict whether an employee in the tech-industry would have mental health issues or not. 
Before making use of our features to predict whether an employee in the tech-industry would have mental health issues or not, we need to transform our raw features into new numerical/encoded features that the model could interpret i.e., we will start with feature engineering.

**Question 1:** Choose your own attributes that you would like to have in your feature set. Encode them into numerical values. This can be a iterative step, therefore, feel free to come back to this question and add/remove features.

**Hint:** You can use any encoding technique, either manually or using libraries like scikit learn. 

In [None]:
# Enter Code here

**Question 2:** Create a correlation matrix to see how do your selected features correlate with each other and **the test attribute**. 

In [None]:
# Enter Code here

**Question 3:** Next, we want to create our feature set and our label vector. Extract your chosen attributes from the dataframe and create a feature set. Similarly, create a label vector. Decide which column will be appropriate to act as a label vector. Give reasoning for your choice of both the feature set and test vector at the end of the cell as comments. Also comment on the shape of both of them.

In [None]:
# Enter Code here
Feature_set = ...
Label = ...

# Reasoning about Feature Set:

# Reasoning about Test Label:

# Comments about shape of feature set and test label

In [None]:
print("The Feature Set of our Model:\n\n",Feature_set)

In [None]:
print("Labels:",Label)

### Creating the Model

Unlike the previous part, this is not a simple linear regression problem. This is a classification problem. Lets apply a **K-Nearest Neighbor Model** to it. The application of KNN Classifier via scikit learn is similar to the application of other models that have been covered in class (e.g., Linear Regression). It is, however, recommended to read up about knn using the resources mentioned below:

K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used for both classification as well as regression predictive problems. However, it is mainly used for classification predictive problems in industry. K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new datapoints which further means that the new data point will be assigned a value based on how closely it matches the points in the training set.
You can read more on KNNs from this documentation. It will also help you build the model ahead:

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

**Importing Scikit Learn Libraries**

In [None]:
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
from sklearn.model_selection import train_test_split # For splitting dataset
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, multilabel_confusion_matrix # Evaluation metrics
from sklearn.neighbors import KNeighborsClassifier # KNN Classifier

**Splitting Data**

To set up a model, you first have to split your data into train and test sets. We can do this manually, but that will not be an optimized approach and can include biases. Therefore, we have an in-built function for it: train_test_split <br> Read this documentation to see how to use it: <br> https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html


**Question 4:** Split the data into train and set into a split ratio of your choice. Also, provide a reason of the selection split ratio.
**Hint:** It is a one-liner!

In [None]:
# Split dataset into training set and test set. Choose the test_size parameter yourself. 

X_train = ...
X_test = ...
y_train = ...
y_test = ...

# Reasoning for selected split ratio:

# Printing shapes
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

**Question 5:** Now let's create the KNN classifier. Follow the guidelines below to create, fit, and make predictions using your initial classifier. Refer to the KNN documentation for help. Then calculate the accuracy of the classifier.

In [None]:
# Create KNN Classifier. Keep any value of K for now e.g. 3,5,7...

# Train the model using the training sets

# Predict the response for test dataset
y_pred = ...

We can now calculate the initial accuracy of our model. Remember that we have not optimized any parameters yet.

In [None]:
# Model Accuracy, how often is the classifier correct?
Accuracy = ...
print("Accuracy in Percentage:", Accuracy)

There is no right or wrong answer wrt accuracy and how high it should be. However, if your accuracy is very low, try reitering/ improving features etc. 

### Optimization

One of the main parameters for KNN is the number of neighbors. We want to find out for which value of K will the loss be minimized. To do that, we create models for varying values of K, and plot a graph of Error Rate against Number of Neighbors.

In [None]:
# No changes are needed in this cell. However, you can change the variable names etc.
error_rate = []

for i in range(1,20): 
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    y_pred = knn.predict(X_test)
    error_rate.append(np.mean(y_pred != y_test))

plt.figure(figsize=(8,5))
plt.plot(range(1,20),error_rate,color='blue', linestyle='dashed', marker='o',markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('Number of Neighbors-K')
plt.ylabel('Error Rate')
plt.show()

min_val = 1+error_rate.index(min(error_rate))
print("Minimum error: ",np.around(min(error_rate),decimals=5),"at K =", min_val)

**Question 6:** Now that you know the optimum value for K, create another model using this value. Also predict using X_train (as well as X_test) and store their results in separate variables. We will use that in the next question. <br>

In [None]:
# Create KNN Classifier

# Train the model using the training sets

# Predict the response for Test data

# Predict the response for Train data

### Evaluation Metrics

**Question 7:**  The final part of the modelling process is evaluation of the model. There a different metrices that we can use to do so. Scikit Learn has in-built functions for calculating each metric. Look up for these functions and print the following: 

*   Train Accuracy
*   Test Accuracy
*   F1 Score
*   Precision
*   Recall

In [None]:
# Enter Code here

**Question 8:** As a final exercise, we want to summarize our prediction results using a confusion matrix. Create and display a confusion matrix that shows the test predictions. Again, use sklearn's built in function for the purpose.

**Note:** Do not forget to label the plot properly.

In [None]:
# Enter Code here

**Question 9:** Do the above results validate the accuracy, recall, and precision scores that you received in **Question 7**? <br>Manually calculate the accuracy, precision, and recall of the test predicitions. Do not forget to state the formula you used for each.

Enter Answer here

### Yayy, Good Job!

Congratulations! You have successfully completed your assignment on creating machine learning models. In this part, we saw how implementations are already available in libraries such as Scikit Learn, which streamlines the entire process of machine learning. Various parameters can also be tuned in order to increase performances. <br> We hope that you enjoyed this assignment, and have gained some hands on experience of building machine learning models and classifiers. <br> 

This also wraps up CS334's assignments. Good job and all the best for the remaining semester! :')