## **Datasets**

We will use Python `scikit-learn` to build a classifier on a dataset about data science workers and learners. This dataset has close to three thousand rows and quite many complex columns. To make it easier to get started, we also provide a smaller dataset with less columns. Both datasets are provided as CSV files in the assignment's entry in canvas. You will need to upload these CSV files to your Google Colab working directory. Once the CSV files are in your working directory, let's load the small CSV file `small_ds_workers_learners.csv` into a pandas DataFrame.

In [1]:
import pandas as pd
survey = pd.read_csv('small_ds_workers_learners.csv', delimiter=',', decimal=",")

Let's gain some basic understanding of the dataset by using `info()`.

In [2]:
survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2650 entries, 0 to 2649
Data columns (total 4 columns):
 #   Column                                           Non-Null Count  Dtype 
---  ------                                           --------------  ----- 
 0   Yearly salary                                    1845 non-null   object
 1   Years of experience in machine learning methods  2462 non-null   object
 2   Most frequently used big data products           624 non-null    object
 3   Regularly use Scikit-learn                       1433 non-null   object
dtypes: object(4)
memory usage: 82.9+ KB


We can see that there are null values in every column, since the non-null count of each column is less than 2650. We will make `Yearly salary` the class/prediction attribute. Therefore, let's go ahead to remove rows with missing values in column `Yearly salary`. 

In [3]:
survey = survey[survey['Yearly salary'].notna()]

Now let's find out all distinct values in column `Yearly salary`.

In [4]:
survey['Yearly salary'].unique()

array(['15,000-19,999', '100,000-124,999', '70,000-79,999',
       '300,000-499,999', '200,000-249,999', '125,000-149,999',
       '60,000-69,999', '25,000-29,999', '250,000-299,999',
       '80,000-89,999', '40,000-49,999', '150,000-199,999', '0-999',
       '90,000-99,999', '30,000-39,999', '50,000-59,999', '10,000-14,999',
       '500,000-999,999', '4,000-4,999', '1,000-1,999', '2,000-2,999',
       '>1,000,000', '20,000-24,999', '7,500-9,999', '5,000-7,499',
       '3,000-3,999'], dtype=object)

**The classification task in this assignment is to predict whether a data science worker/learner makes more than \$100K in a year or not, i.e., it is a binary classification task**. Hence, we now replace all salary values less than \$100K with 'No', and replace all other values with 'Yes'.

In [5]:
survey['Yearly salary'] = survey['Yearly salary'].map({'100,000-149,999': 'Yes', '150,000-199,999': 'Yes', '200,000-249,999': 'Yes', '250,000-299,999': 'Yes', '300,000-499,999': 'Yes', '500,000-999,999': 'Yes', '>1,000,000': 'Yes'})
survey.loc[survey['Yearly salary'] != 'Yes', 'Yearly salary'] = 'No'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Now let's take a look at the first 20 rows after these transformations. 

In [6]:
survey.head(20)

Unnamed: 0,Yearly salary,Years of experience in machine learning methods,Most frequently used big data products,Regularly use Scikit-learn
1,No,3-4 years,MySQL,Yes
3,No,4-5 years,,Yes
4,No,I do not use machine learning methods,,
5,Yes,I do not use machine learning methods,MySQL,
6,Yes,5-10 years,PostgreSQL,Yes
7,No,1-2 years,,Yes
8,No,5-10 years,Microsoft SQL Server,
9,No,Under 1 year,,Yes
10,No,I do not use machine learning methods,,
11,No,4-5 years,,Yes


## **Data Munging**

From the table above, we see that none of the columns has numberic values. In `scikit-learn`, there are limited ways of building models that directly work with catagorical attributes. We need to preprocess these columns before we can build and evaluate models. More specifically, we need to encode these columns in numeric values. The 3 feature columns in this small dataset are different and we will pre-process each in a different way. In fact, they represent the three types of columns in the larger dataset. Therefore, the following tasks of pre-processing the small dataset will prepare you for working on the larger dataset. 

### **1. Binary attribute: `Regularly use Scikit-learn`**

The column `Regularly use Scikit-learn` describes whether a person uses `scikit-learn` on a regular basis. It has two values 'Yes' and NaN (i.e., null value). Based on how the dataset was created, NaN here means 'No'. Let's replace the values in this column with `1` and `0`.

In [7]:
survey.replace(to_replace={'Regularly use Scikit-learn':'Yes'},value=1,inplace=True)

In [8]:
survey.fillna(value={'Regularly use Scikit-learn':0},inplace=True)

In [9]:
survey.head(20)

Unnamed: 0,Yearly salary,Years of experience in machine learning methods,Most frequently used big data products,Regularly use Scikit-learn
1,No,3-4 years,MySQL,1.0
3,No,4-5 years,,1.0
4,No,I do not use machine learning methods,,0.0
5,Yes,I do not use machine learning methods,MySQL,0.0
6,Yes,5-10 years,PostgreSQL,1.0
7,No,1-2 years,,1.0
8,No,5-10 years,Microsoft SQL Server,0.0
9,No,Under 1 year,,1.0
10,No,I do not use machine learning methods,,0.0
11,No,4-5 years,,1.0


### **2. Nominal attribute: `Most frequently used big data products`**

The column `Most frequently used big data products` describes the big data product that a person uses most frequently. It has values such as `MySQL`, `PostgreSQL` and so on. Based on what we learned earlier in the semester, this is a nominal attribute in that there isn't a meaningful order among the attribute values. We will use one-hot encoding to represent this attribute. More specifically, we will make one new binary-value column for each distinct big data product. A row has value `1` or `0` in that new column, based on its value in the original `Most frequently used big data products` column. In Programm Assignment 2, we actually performed similar operations. 

Go ahead to apply the following code. After that, the results of `survey.head(20)` show the new columns, each with the prefix `Bigd`. Note that we also dropped the original column `Most frequently used big data products`.


In [10]:
bigd = pd.get_dummies(survey['Most frequently used big data products'], prefix='Bigd')
survey = survey.drop(['Most frequently used big data products'], axis=1)
survey = pd.concat([survey, bigd], 1)

In [11]:
survey.head(20)

Unnamed: 0,Yearly salary,Years of experience in machine learning methods,Regularly use Scikit-learn,Bigd_Amazon Aurora,Bigd_Amazon DynamoDB,Bigd_Amazon RDS,Bigd_Amazon Redshift,Bigd_Google Cloud BigQuery,Bigd_Google Cloud BigTable,Bigd_Google Cloud Firestore,Bigd_Google Cloud SQL,Bigd_Google Cloud Spanner,Bigd_IBM Db2,Bigd_Microsoft Azure Cosmos DB,Bigd_Microsoft Azure SQL Database,Bigd_Microsoft SQL Server,Bigd_MongoDB,Bigd_MySQL,Bigd_Oracle Database,Bigd_Other,Bigd_PostgreSQL,Bigd_SQLite,Bigd_Snowflake
1,No,3-4 years,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
3,No,4-5 years,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,No,I do not use machine learning methods,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,Yes,I do not use machine learning methods,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
6,Yes,5-10 years,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
7,No,1-2 years,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,No,5-10 years,0.0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
9,No,Under 1 year,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
10,No,I do not use machine learning methods,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
11,No,4-5 years,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### **3. Ordinal attribute: `Years of experience in machine learning methods`**

Let's take a look at the distinct values of column `Years of experience in machine learning methods`. This is an ordinal attribute, since these values capture different levels of experience, from none to abundant experience. Let's map these values into the scale of `1`-`9`. 


In [12]:
survey['Years of experience in machine learning methods'].unique()

array(['3-4 years', '4-5 years', 'I do not use machine learning methods',
       '5-10 years', '1-2 years', 'Under 1 year', '2-3 years', nan,
       '20 or more years', '10-20 years'], dtype=object)

In [13]:
survey['Years of experience in machine learning methods'] = survey['Years of experience in machine learning methods'].map({'I do not use machine learning methods':1,'Under 1 year':2,'1-2 years':3,'2-3 years':4,'3-4 years':5,'4-5 years':6,'5-10 years':7,'10-20 years':8,'20 or more years':9})

The column `Years of experience in machine learning methods` has null values. We are going to replace these null values by `0`. Note that this is not an ideal solution. Given that 0 is less than 1, the classification model we are going to build may pick up the signal that a person having `0` in this column has less experience than a person having `1`, which may not be the case. However, we don't really have a better solution, unless we keep the null values. There are some implementation of learning algorithms in `scikit-learn` that admit null values and there are other libraries to use. But let's don't make things too complicated in this assignment. Let's just replace NaN by `0` in this column. 

In [14]:
survey.fillna(value={'Years of experience in machine learning methods':0},inplace=True)

In [15]:
survey.head(20)

Unnamed: 0,Yearly salary,Years of experience in machine learning methods,Regularly use Scikit-learn,Bigd_Amazon Aurora,Bigd_Amazon DynamoDB,Bigd_Amazon RDS,Bigd_Amazon Redshift,Bigd_Google Cloud BigQuery,Bigd_Google Cloud BigTable,Bigd_Google Cloud Firestore,Bigd_Google Cloud SQL,Bigd_Google Cloud Spanner,Bigd_IBM Db2,Bigd_Microsoft Azure Cosmos DB,Bigd_Microsoft Azure SQL Database,Bigd_Microsoft SQL Server,Bigd_MongoDB,Bigd_MySQL,Bigd_Oracle Database,Bigd_Other,Bigd_PostgreSQL,Bigd_SQLite,Bigd_Snowflake
1,No,5.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
3,No,6.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,No,1.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,Yes,1.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
6,Yes,7.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
7,No,3.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,No,7.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
9,No,2.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
10,No,1.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
11,No,6.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## **Prepare the Larger Dataset**

Now that we have finished the exercise of pre-processing the smaller dataset, let's get the larger dataset ready. Once again, the CSV files can be also found in the assignment's entry in canvas. You will need to upload these CSV files to your Google Colab working directory. Once the CSV files are in your working directory, let's load the larger CSV file `ds_workers_learners.csv` into a pandas DataFrame.

In [16]:
survey = pd.read_csv('ds_workers_learners.csv', delimiter=',', decimal=",")

In [18]:
"""
I am making a lot of changes in the same code cell 
hence the warnings but if i run the first 2 cells in 
a seperate code cell there wouldn't be any warnings.
"""
import numpy as np
survey = survey[survey['Yearly salary'].notna()]
survey['Yearly salary'] = survey['Yearly salary'].map({'100,000-149,999': 'Yes', '150,000-199,999': 'Yes', '200,000-249,999': 'Yes', '250,000-299,999': 'Yes', '300,000-499,999': 'Yes', '500,000-999,999': 'Yes', '>1,000,000': 'Yes'})
survey.loc[survey['Yearly salary'] != 'Yes', 'Yearly salary'] = 'No'
survey['Age']=survey['Age'].map({'18-21':0,'22-24':1,'25-29':2,'30-34':3,'35-39':4,'40-44':5,'45-49':6,'50-54':7,'55-59':8,'60-69':9,'70+':10})
survey['Degree']=survey['Degree'].map({'prefer not to answer':0,'high school':1,'college study without degree':2,"Bachelor's":3,"Master's":4,'Professional doctorate':5,'Doctoral':6})
survey['Experience with TPU']=survey['Experience with TPU'].map({np.nan:0, 'Never':1, 'Once':2, '2-5 times':3, '6-25 times':4, 'More than 25 times':5})
gend = pd.get_dummies(survey['Gender'], prefix='Gender')
industry = pd.get_dummies(survey['Industry of employer'],prefix='Industry')
bigd = pd.get_dummies(survey['Most frequently used big data products'], prefix='Bigd')
survey.replace(to_replace={'Most frequently used data science platform':'None'},value=np.nan,inplace=True)
platf = pd.get_dummies(survey['Most frequently used data science platform'], prefix='Platform')
tool = pd.get_dummies(survey['Primary tool for analyzing data'], prefix='Tool')
survey['Regularly use Bayesian Approaches'] = survey['Regularly use Bayesian Approaches'].map({np.nan:0,'Yes':1})
survey['Regularly use Convolutional Neural Networks'] = survey['Regularly use Convolutional Neural Networks'].map({np.nan:0,'Yes':1})
survey['Regularly use Decision Trees or Random Forests'] = survey['Regularly use Decision Trees or Random Forests'].map({np.nan:0,'Yes':1})
survey['Regularly use Gradient Boosting Machines'] = survey['Regularly use Gradient Boosting Machines'].map({np.nan:0,'Yes':1})
survey['Regularly use Keras'] = survey['Regularly use Keras'].map({np.nan:0,'Yes':1})
survey['Regularly use Linear or Logistic Regression'] = survey['Regularly use Linear or Logistic Regression'].map({np.nan:0,'Yes':1})
survey['Regularly use Python'] = survey['Regularly use Python'].map({np.nan:0,'Yes':1})
survey['Regularly use PyTorch'] = survey['Regularly use PyTorch'].map({np.nan:0,'Yes':1})
survey['Regularly use R'] = survey['Regularly use R'].map({np.nan:0,'Yes':1})
survey['Regularly use Scikit-learn'] = survey['Regularly use Scikit-learn'].map({np.nan:0,'Yes':1})
survey['Regularly use SQL'] = survey['Regularly use SQL'].map({np.nan:0,'Yes':1})
survey['Regularly use TensorFlow'] = survey['Regularly use TensorFlow'].map({np.nan:0,'Yes':1})
survey['Regularly use Xgboost'] = survey['Regularly use Xgboost'].map({np.nan:0,'Yes':1})
survey['Size of employer'] = survey['Size of employer'].map({np.nan:1,'0-49 employees':1,'50-249 employees':2,'250-999 employees':3,'1000-9,999 employees':4,'10,000 or more employees':5,})
state_emp = pd.get_dummies(survey['State of employer in incorporate machine learning into business'],prefix='State')
title = pd.get_dummies(survey['Title'],prefix='Title')
survey = survey.drop(['Gender','Industry of employer','Most frequently used big data products','Most frequently used data science platform','Primary tool for analyzing data','State of employer in incorporate machine learning into business','Title'], axis=1)
survey = pd.concat([survey, gend, industry, bigd, platf, tool, state_emp, title], 1)
survey['Years of coding experience'] = survey['Years of coding experience'].map({'I have never written code':0,'< 1 years':1,'1-3 years':2,'3-5 years':3,'5-10 years':4,'10-20 years':5,'20+ years':6})
survey['Years of experience in machine learning methods'] = survey['Years of experience in machine learning methods'].map({np.nan:0,'I do not use machine learning methods':1,'Under 1 year':2,'1-2 years':3,'2-3 years':4,'3-4 years':5,'4-5 years':6,'5-10 years':7,'10-20 years':8,'20 or more years':9})

In [19]:
survey.head(20)

Unnamed: 0,Age,Degree,Size of employer,Yearly salary,Years of coding experience,Years of experience in machine learning methods,Experience with TPU,Regularly use Python,Regularly use R,Regularly use SQL,Regularly use Scikit-learn,Regularly use TensorFlow,Regularly use Keras,Regularly use PyTorch,Regularly use Xgboost,Regularly use Linear or Logistic Regression,Regularly use Decision Trees or Random Forests,Regularly use Gradient Boosting Machines,Regularly use Bayesian Approaches,Regularly use Convolutional Neural Networks,Gender_Man,Gender_Nonbinary,Gender_Prefer not to say,Gender_Prefer to self-describe,Gender_Woman,Industry_Academics/Education,Industry_Accounting/Finance,Industry_Broadcasting/Communications,Industry_Computers/Technology,Industry_Energy/Mining,Industry_Government/Public Service,Industry_Hospitality/Entertainment/Sports,Industry_Insurance/Risk Assessment,Industry_Manufacturing/Fabrication,Industry_Marketing/CRM,Industry_Medical/Pharmaceutical,Industry_Military/Security/Defense,Industry_Non-profit/Service,Industry_Online Business/Internet-based Sales,Industry_Online Service/Internet-based Services,...,Bigd_Microsoft Azure Cosmos DB,Bigd_Microsoft Azure SQL Database,Bigd_Microsoft SQL Server,Bigd_MongoDB,Bigd_MySQL,Bigd_Oracle Database,Bigd_Other,Bigd_PostgreSQL,Bigd_SQLite,Bigd_Snowflake,Platform_Other,Platform_cloud computing platform,Platform_deep learning workstation,Platform_desktop,Platform_laptop,Tool_Advanced statistical software,Tool_Basic statistical software,Tool_Business intelligence software,Tool_Cloud-based data software & APIs,Tool_Local development environments,Tool_Other,State_I do not know,State_do not use,State_exploring,State_for insights only,State_recently started,State_well established,Title_Business Analyst,Title_DBA/Database Engineer,Title_Data Analyst,Title_Data Engineer,Title_Data Scientist,Title_Developer Relations/Advocacy,Title_Machine Learning Engineer,Title_Other,Title_Product Manager,Title_Program/Project Manager,Title_Research Scientist,Title_Software Engineer,Title_Statistician
1,8,4,1,No,5,5,1,1,0,1,1,1,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0
3,3,4,5,No,4,6,1,1,0,1,1,0,0,0,1,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0
4,5,3,4,No,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
5,7,4,5,Yes,6,1,4,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0
6,4,5,5,Yes,5,7,3,1,1,1,1,0,0,1,1,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
7,2,4,4,No,2,3,1,1,0,1,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0
8,8,3,2,No,6,7,2,1,0,1,0,0,0,1,1,1,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
9,3,4,5,No,2,2,1,1,0,1,1,0,0,0,0,1,1,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0
10,8,3,4,No,2,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
11,5,4,2,No,4,6,3,1,0,0,1,1,1,1,1,1,1,1,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0


Make sure to follow the good practices about model selection and model evaluation. For model evaluation: 

1.   Partition the dataset into training set and test set. The test set shouldn't be used in any way during training your model. 
2.   Use cross-validation in order to get more robust evaluation results. 
3.   After evaluation, you can train your model again on the whole dataset. Then the trained model can be made available to classify unseen instances in the future. Of course, in this assignment, we don't really have unseen instances to be applied. Maybe you can plug in your own information to see how the model predicts, just for fun. 

For model selection: 

1.   Model selection is the step for choosing the optimal model among multiple different types of models (e.g., a decision tree vs. a kNN classifier), or for tuning the hyperparameters (e.g., the maximum depth in a decision tree) in order to get the optimal model within the same family of models. 

2.  In model selection, you further partition the training set (from model evaluation) into train set and validation set.  (Here we call it 'train set', to make it clear it is a subset of the 'training set'.)

3.  Different models are trained using the train test and their performance on the validation set is used to select the best model and/or best hyperparameters. 

4. Model section itself can also use cross-validation. 

In [22]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

train_feature, test_feature, train_class, test_class = train_test_split( survey.drop(columns='Yearly salary',axis=1),survey['Yearly salary']
    , stratify=survey['Yearly salary'], random_state=42)

tree = DecisionTreeClassifier(max_depth=5,max_features=45,random_state=0)
tree.fit(train_feature, train_class)
scores = cross_val_score(tree, survey.drop(columns='Yearly salary',axis=1),survey['Yearly salary'], cv=5)
print("Decision Tree: ")
print()
print("Cross-validation scores: {}".format(scores))
print("Average cross-validation score: {:.2f}".format(scores.mean()))
print()
print("Test set score: {:.3f}".format(tree.score(test_feature, test_class)))
print()
prediction = tree.predict(test_feature)
print("Confusion matrix:")
print(pd.crosstab(test_class, prediction, rownames=['True'], colnames=['Predicted'], margins=True))
print()
print("Classification report:")
print(classification_report(test_class, prediction))
print()

from sklearn.linear_model import LogisticRegression

train_feature, test_feature, train_class, test_class = train_test_split( survey.drop(columns='Yearly salary',axis=1),survey['Yearly salary']
    , stratify=survey['Yearly salary'], random_state=42)

log_reg = LogisticRegression(random_state=0,max_iter=2000).fit(train_feature, train_class)
scores = cross_val_score(log_reg, survey.drop(columns='Yearly salary',axis=1),survey['Yearly salary'], cv=5)
print("Logistic Regression: ")
print()
print("Cross-validation scores: {}".format(scores))
print("Average cross-validation score: {:.2f}".format(scores.mean()))
print()
print("Test set score: {:.3f}".format(log_reg.score(test_feature, test_class)))
print()
prediction = log_reg.predict(test_feature)
print("Confusion matrix:")
print(pd.crosstab(test_class, prediction, rownames=['True'], colnames=['Predicted'], margins=True))
print()
print("Classification report:")
print(classification_report(test_class, prediction))
print()

from sklearn.neighbors import KNeighborsClassifier

train_feature, test_feature, train_class, test_class = train_test_split( survey.drop(columns='Yearly salary',axis=1),survey['Yearly salary']
    , stratify=survey['Yearly salary'], random_state=42)

knn = KNeighborsClassifier(n_neighbors=100)
knn.fit(train_feature, train_class)
scores = cross_val_score(knn, survey.drop(columns='Yearly salary',axis=1),survey['Yearly salary'], cv=5)
print("KNeighbors Classifier: ")
print()
print("Cross-validation scores: {}".format(scores))
print("Average cross-validation score: {:.2f}".format(scores.mean()))
print()
print("Test set score: {:.3f}".format(knn.score(test_feature, test_class)))
print()
prediction = knn.predict(test_feature)
print("Confusion matrix:")
print(pd.crosstab(test_class, prediction, rownames=['True'], colnames=['Predicted'], margins=True))
print()
print("Classification report:")
print(classification_report(test_class, prediction))

Decision Tree: 

Cross-validation scores: [0.77777778 0.78590786 0.79403794 0.7804878  0.79132791]
Average cross-validation score: 0.79

Test set score: 0.810

Confusion matrix:
Predicted   No  Yes  All
True                    
No         308   31  339
Yes         57   66  123
All        365   97  462

Classification report:
              precision    recall  f1-score   support

          No       0.84      0.91      0.88       339
         Yes       0.68      0.54      0.60       123

    accuracy                           0.81       462
   macro avg       0.76      0.72      0.74       462
weighted avg       0.80      0.81      0.80       462


Logistic Regression: 

Cross-validation scores: [0.81300813 0.81842818 0.82926829 0.79674797 0.78590786]
Average cross-validation score: 0.81

Test set score: 0.827

Confusion matrix:
Predicted   No  Yes  All
True                    
No         307   32  339
Yes         48   75  123
All        355  107  462

Classification report:
            



---
I have started with Decision Tree as it could perform better in most of the situations. I wanted to use all of the features and let model select the main/top features. Decision tree could use only selected number of features based on max_features attribute. I have also used max_depth to control the accuracy.
 
Later as there are more number of features, linear models will perform better. Hence choosed Logistic Regression. Third model is Kneighbors classifier to have a baseline model other than the one provided.
 
To experiment with model parameters I found better accuracy when neighbors are 100 in KNeighbors classifier, there’s also similar performance at around 140. Decision Tree performed better with max_depth=5, max_features=45 and Logistic Regression performed better with default parameters.
 
KNeighbors classifier performed better than baseline model with additional 5% accuracy. Though there is 2% drop in precision in No class, there is a significant 10% increase in No class. In Yes class, we have significant 12% increase in precision but there is similar trade off in recall with 9% drop. 
 
As expected Decision Tree performed better than KNeighbors Classifier with better recall on No class and better precision on Yes class. Although there is not much difference in accuracy between KNeighbors and Decision Tree but having that 1% additional accuracy could differ a lot in production. also it is 6% additional accuracy than baseline provided.
 
As I said earlier if there are a lot of features than inputs, linear models tend to perform better than other models. Even though that is not the case here, 93 features are still a lot. Hence applying Logistic Regression proved better than Decision tree and KNeighbors classifier. With better precision and recall and overall accuracy Linear model Logistic Regression gained additional 2.7% accuracy than Kneighbors, 1.7% more accuracy than Decision Tree and 7.7% accuracy than baseline model provided.




---

