<a href="https://colab.research.google.com/github/vsharma1205/Classifier-Using-Scikit-Learn/blob/main/Building_Classifier_using_'scikit_learn'.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Datasets**

In this assignment, we will use Python `scikit-learn` to build a classifier on a dataset about data science workers and learners. This dataset has close to three thousand rows and quite many complex columns. 

In [None]:
import pandas as pd
survey = pd.read_csv('small_ds_workers_learners.csv', delimiter=',', decimal=",")

Let's gain some basic understanding of the dataset by using `info()`.

In [None]:
survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2650 entries, 0 to 2649
Data columns (total 4 columns):
 #   Column                                           Non-Null Count  Dtype 
---  ------                                           --------------  ----- 
 0   Yearly salary                                    1845 non-null   object
 1   Years of experience in machine learning methods  2462 non-null   object
 2   Most frequently used big data products           624 non-null    object
 3   Regularly use Scikit-learn                       1433 non-null   object
dtypes: object(4)
memory usage: 82.9+ KB


We can see that there are null values in every column, since the non-null count of each column is less than 2650. We will make `Yearly salary` the class/prediction attribute. Therefore, let's go ahead to remove rows with missing values in column `Yearly salary`. 

In [None]:
survey = survey[survey['Yearly salary'].notna()]

Now let's find out all distinct values in column `Yearly salary`.

In [None]:
survey['Yearly salary'].unique()

array(['15,000-19,999', '100,000-124,999', '70,000-79,999',
       '300,000-499,999', '200,000-249,999', '125,000-149,999',
       '60,000-69,999', '25,000-29,999', '250,000-299,999',
       '80,000-89,999', '40,000-49,999', '150,000-199,999', '0-999',
       '90,000-99,999', '30,000-39,999', '50,000-59,999', '10,000-14,999',
       '500,000-999,999', '4,000-4,999', '1,000-1,999', '2,000-2,999',
       '>1,000,000', '20,000-24,999', '7,500-9,999', '5,000-7,499',
       '3,000-3,999'], dtype=object)

**The classification task in this assignment is to predict whether a data science worker/learner makes more than \$100K in a year or not, i.e., it is a binary classification task**. Hence, we now replace all salary values less than \$100K with 'No', and replace all other values with 'Yes'.

In [None]:
survey['Yearly salary'] = survey['Yearly salary'].map({'100,000-149,999': 'Yes', '150,000-199,999': 'Yes', '200,000-249,999': 'Yes', '250,000-299,999': 'Yes', '300,000-499,999': 'Yes', '500,000-999,999': 'Yes', '>1,000,000': 'Yes'})
survey.loc[survey['Yearly salary'] != 'Yes', 'Yearly salary'] = 'No'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Now let's take a look at the first 20 rows after these transformations. 

In [None]:
survey.head(20)

Unnamed: 0,Yearly salary,Years of experience in machine learning methods,Most frequently used big data products,Regularly use Scikit-learn
1,No,3-4 years,MySQL,Yes
3,No,4-5 years,,Yes
4,No,I do not use machine learning methods,,
5,Yes,I do not use machine learning methods,MySQL,
6,Yes,5-10 years,PostgreSQL,Yes
7,No,1-2 years,,Yes
8,No,5-10 years,Microsoft SQL Server,
9,No,Under 1 year,,Yes
10,No,I do not use machine learning methods,,
11,No,4-5 years,,Yes


## **Data Munging**

From the table above, we see that none of the columns has numberic values. In `scikit-learn`, there are limited ways of building models that directly work with catagorical attributes. We need to preprocess these columns before we can build and evaluate models. More specifically, we need to encode these columns in numeric values. The 3 feature columns in this small dataset are different and we will pre-process each in a different way. In fact, they represent the three types of columns in the larger dataset. Therefore, the following tasks of pre-processing the small dataset will prepare us for working on the larger dataset. 

### **1. Binary attribute: `Regularly use Scikit-learn`**

The column `Regularly use Scikit-learn` describes whether a person uses `scikit-learn` on a regular basis. It has two values 'Yes' and NaN (i.e., null value). Based on how the dataset was created, NaN here means 'No'. Let's replace the values in this column with `1` and `0`.

## **Task 1: In column `Regularly use Scikit-learn`, replace 'Yes' by `1`.** 


In [None]:
# Code for Task 1
survey['Regularly use Scikit-learn'] = survey['Regularly use Scikit-learn'].map({'Yes':1})

## **Task 2: In column `Regularly use Scikit-learn`, replace NaN by `0`.** 

In [None]:
# Code for Task 2
survey['Regularly use Scikit-learn'].fillna(0, inplace=True)

In [None]:
survey.head(20)

Unnamed: 0,Yearly salary,Years of experience in machine learning methods,Most frequently used big data products,Regularly use Scikit-learn
1,No,3-4 years,MySQL,1.0
3,No,4-5 years,,1.0
4,No,I do not use machine learning methods,,0.0
5,Yes,I do not use machine learning methods,MySQL,0.0
6,Yes,5-10 years,PostgreSQL,1.0
7,No,1-2 years,,1.0
8,No,5-10 years,Microsoft SQL Server,0.0
9,No,Under 1 year,,1.0
10,No,I do not use machine learning methods,,0.0
11,No,4-5 years,,1.0


### **2. Nominal attribute: `Most frequently used big data products`**

The column `Most frequently used big data products` describes the big data product that a person uses most frequently. It has values such as `MySQL`, `PostgreSQL` and so on. Based on what we learned earlier in the semester, this is a nominal attribute in that there isn't a meaningful order among the attribute values. We will use one-hot encoding to represent this attribute. More specifically, we will make one new binary-value column for each distinct big data product. A row has value `1` or `0` in that new column, based on its value in the original `Most frequently used big data products` column. 

In [None]:
bigd = pd.get_dummies(survey['Most frequently used big data products'], prefix='Bigd')
survey = survey.drop(['Most frequently used big data products'], axis=1)
survey = pd.concat([survey, bigd], 1)

In [None]:
survey.head(20)

Unnamed: 0,Yearly salary,Years of experience in machine learning methods,Regularly use Scikit-learn,Bigd_Amazon Aurora,Bigd_Amazon DynamoDB,Bigd_Amazon RDS,Bigd_Amazon Redshift,Bigd_Google Cloud BigQuery,Bigd_Google Cloud BigTable,Bigd_Google Cloud Firestore,Bigd_Google Cloud SQL,Bigd_Google Cloud Spanner,Bigd_IBM Db2,Bigd_Microsoft Azure Cosmos DB,Bigd_Microsoft Azure SQL Database,Bigd_Microsoft SQL Server,Bigd_MongoDB,Bigd_MySQL,Bigd_Oracle Database,Bigd_Other,Bigd_PostgreSQL,Bigd_SQLite,Bigd_Snowflake
1,No,3-4 years,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
3,No,4-5 years,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,No,I do not use machine learning methods,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,Yes,I do not use machine learning methods,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
6,Yes,5-10 years,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
7,No,1-2 years,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,No,5-10 years,0.0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
9,No,Under 1 year,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
10,No,I do not use machine learning methods,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
11,No,4-5 years,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### **3. Ordinal attribute: `Years of experience in machine learning methods`**

Let's take a look at the distinct values of column `Years of experience in machine learning methods`. This is an ordinal attribute, since these values capture different levels of experience, from none to abundant experience. Let's map these values into the scale of `1`-`9`. 


In [None]:
survey['Years of experience in machine learning methods'].unique()

array(['3-4 years', '4-5 years', 'I do not use machine learning methods',
       '5-10 years', '1-2 years', 'Under 1 year', '2-3 years', nan,
       '20 or more years', '10-20 years'], dtype=object)

## **Task 3: In column `Years of experience in machine learning methods`, replace column values by `1` - `9` --- 'I do not use machine learning methods' by numeric value `1`, 'Under 1 year' by numeric value `2`, ..., and '20 or more years' by numeric value `9`.**

In [None]:
# Code for Task 3
survey['Years of experience in machine learning methods'] = survey['Years of experience in machine learning methods'].map({'I do not use machine learning methods':1, 
                                                               'Under 1 year':2, 
                                                               '1-2 years':3, 
                                                               '2-3 years':4, 
                                                               '3-4 years':5,
                                                               '4-5 years':6,
                                                               '5-10 years':7,
                                                               '10-20 years':8,
                                                               '20 or more years':9})

The column `Years of experience in machine learning methods` has null values. We are going to replace these null values by `0`. Note that this is not an ideal solution. Given that 0 is less than 1, the classification model we are going to build may pick up the signal that a person having `0` in this column has less experience than a person having `1`, which may not be the case. However, we don't really have a better solution, unless we keep the null values. There are some implementation of learning algorithms in `scikit-learn` that admit null values and there are other libraries to use. But let's just replace NaN by `0` in this column. 

## **Task 4: In column `Years of experience in machine learning methods`, replace `NaN` by `0`.**

In [None]:
# Code for Task 4
survey['Years of experience in machine learning methods'].fillna(0, inplace=True)

In [None]:
survey.head(20)

Unnamed: 0,Yearly salary,Years of experience in machine learning methods,Regularly use Scikit-learn,Bigd_Amazon Aurora,Bigd_Amazon DynamoDB,Bigd_Amazon RDS,Bigd_Amazon Redshift,Bigd_Google Cloud BigQuery,Bigd_Google Cloud BigTable,Bigd_Google Cloud Firestore,Bigd_Google Cloud SQL,Bigd_Google Cloud Spanner,Bigd_IBM Db2,Bigd_Microsoft Azure Cosmos DB,Bigd_Microsoft Azure SQL Database,Bigd_Microsoft SQL Server,Bigd_MongoDB,Bigd_MySQL,Bigd_Oracle Database,Bigd_Other,Bigd_PostgreSQL,Bigd_SQLite,Bigd_Snowflake
1,No,5.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
3,No,6.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,No,1.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,Yes,1.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
6,Yes,7.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
7,No,3.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,No,7.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
9,No,2.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
10,No,1.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
11,No,6.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## **Prepare the Larger Dataset**

Now that we have finished the exercise of pre-processing the smaller dataset, let's get the larger dataset ready.

In [None]:
survey = pd.read_csv('ds_workers_learners.csv', delimiter=',', decimal=",")

## **Task 5: Pre-process the larger dataset.**

The larger dataset has much more columns than the smaller one. However, they are similar to the 3 types of columns we explained earlier.

In [None]:
#survey.info()
#survey['Years of coding experience'].unique()
#survey['Age'].unique()
#survey['Degree'].unique()
#survey['Size of employer'].unique()
#survey['Yearly salary'].unique()
#survey['Years of experience in machine learning methods'].unique()
#survey['Experience with TPU'].unique()

In [None]:
# Code for Task 5

survey = survey.replace(['Yes'],1)
survey.fillna(0, inplace=True)
#------------------------------------------------------------------------------------------------------------------------------------
survey['Years of coding experience'] = survey['Years of coding experience'].map({'I have never written code':1, 
                                                                                 '< 1 years	':2, 
                                                                                 '1-3 years':3, 
                                                                                 '3-5 years':4, 
                                                                                 '5-10 years':5,
                                                                                 '10-20 years':6,
                                                                                 '20+ years	':7})
survey['Years of coding experience'].fillna(0, inplace=True)
#------------------------------------------------------------------------------------------------------------------------------------
survey['Age'] = survey['Age'].map({'18-21':1, 
                                   '22-24':2,
                                   '25-29':3,
                                   '30-34':4,
                                   '35-39':5,
                                   '40-44':6,
                                   '45-49':7,
                                   '50-54':8,
                                   '60-69':9,
                                   '70+':10})
survey['Age'].fillna(0, inplace=True)
#------------------------------------------------------------------------------------------------------------------------------------
survey['Size of employer'] = survey['Size of employer'].map({'0-49 employees': 1,
                                                             '50-249 employees': 2,
                                                             '250-999 employees': 3,
                                                             '1000-9,999 employees': 4,
                                                             '10,000 or more employees': 5})
survey['Size of employer'].fillna(0, inplace=True)
#------------------------------------------------------------------------------------------------------------------------------------
survey['Years of experience in machine learning methods'] = survey['Years of experience in machine learning methods'].map({'I do not use machine learning methods':1,
                                                                                                                           'Under 1 year': 2,
                                                                                                                           '1-2 years': 3,
                                                                                                                           '3-4 years': 4,
                                                                                                                           '4-5 years': 5,
                                                                                                                           '5-10 years': 6,
                                                                                                                           '10-20 years': 7,
                                                                                                                           '20 or more years': 8})
survey['Years of experience in machine learning methods'].fillna(0, inplace=True)
#------------------------------------------------------------------------------------------------------------------------------------
survey['Experience with TPU'] = survey['Experience with TPU'].map({'Never': 1,
                                                                   'Once': 2,
                                                                   '2-5 times': 3,
                                                                   '6-25 times': 4,
                                                                   'More than 25 times': 5})
survey['Experience with TPU'].fillna(0, inplace=True)
#------------------------------------------------------------------------------------------------------------------------------------
survey['Degree'] = survey['Degree'].map({'prefer not to answer': 1,
                                         'high school': 2,
                                         'college study without degree': 3,
                                         "Bachelor's": 4,
                                         "Master's": 5,
                                         'Doctoral': 6,
                                         'Professional doctorate': 7})
survey['Degree'].fillna(0, inplace=True)
#------------------------------------------------------------------------------------------------------------------------------------
survey['Yearly salary'] = survey['Yearly salary'].map({'100,000-124,999': 'Yes',
                                                       '125,000-149,999': 'Yes',
                                                       '150,000-199,999': 'Yes',
                                                       '200,000-249,999': 'Yes',
                                                       '250,000-299,999': 'Yes',
                                                       '300,000-499,999': 'Yes',
                                                       '500,000-999,999': 'Yes',
                                                       '>1,000,000': 'Yes'})
survey.loc[survey['Yearly salary'] != 'Yes', 'Yearly salary'] = 'No'
#------------------------------------------------------------------------------------------------------------------------------------
survey = pd.get_dummies(survey, columns=['Title', 
                                'Gender',
                                'Industry of employer', 
                                'State of employer in incorporate machine learning into business', 
                                'Most frequently used data science platform', 
                                'Most frequently used big data products', 
                                'Primary tool for analyzing data'], prefix=["Title", "Gender", "Industry", "State", "Platform", "Bigd", "Tool"])

survey = survey.drop(['Title_Currently not employed', 'Title_Student', 'Industry_0', 'State_0', 'Platform_0', 'Platform_None', 'Bigd_0', 'Tool_0'], axis = 1)

In [None]:
survey.head(20)

Unnamed: 0,Age,Degree,Size of employer,Yearly salary,Years of coding experience,Years of experience in machine learning methods,Experience with TPU,Regularly use Python,Regularly use R,Regularly use SQL,Regularly use Scikit-learn,Regularly use TensorFlow,Regularly use Keras,Regularly use PyTorch,Regularly use Xgboost,Regularly use Linear or Logistic Regression,Regularly use Decision Trees or Random Forests,Regularly use Gradient Boosting Machines,Regularly use Bayesian Approaches,Regularly use Convolutional Neural Networks,Title_Business Analyst,Title_DBA/Database Engineer,Title_Data Analyst,Title_Data Engineer,Title_Data Scientist,Title_Developer Relations/Advocacy,Title_Machine Learning Engineer,Title_Other,Title_Product Manager,Title_Program/Project Manager,Title_Research Scientist,Title_Software Engineer,Title_Statistician,Gender_Man,Gender_Nonbinary,Gender_Prefer not to say,Gender_Prefer to self-describe,Gender_Woman,Industry_Academics/Education,Industry_Accounting/Finance,...,Industry_Other,Industry_Retail/Sales,Industry_Shipping/Transportation,State_I do not know,State_do not use,State_exploring,State_for insights only,State_recently started,State_well established,Platform_Other,Platform_cloud computing platform,Platform_deep learning workstation,Platform_desktop,Platform_laptop,Bigd_Amazon Aurora,Bigd_Amazon DynamoDB,Bigd_Amazon RDS,Bigd_Amazon Redshift,Bigd_Google Cloud BigQuery,Bigd_Google Cloud BigTable,Bigd_Google Cloud Firestore,Bigd_Google Cloud SQL,Bigd_Google Cloud Spanner,Bigd_IBM Db2,Bigd_Microsoft Azure Cosmos DB,Bigd_Microsoft Azure SQL Database,Bigd_Microsoft SQL Server,Bigd_MongoDB,Bigd_MySQL,Bigd_Oracle Database,Bigd_Other,Bigd_PostgreSQL,Bigd_SQLite,Bigd_Snowflake,Tool_Advanced statistical software,Tool_Basic statistical software,Tool_Business intelligence software,Tool_Cloud-based data software & APIs,Tool_Local development environments,Tool_Other
0,2.0,3,3.0,No,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0.0,5,1.0,No,6.0,4.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0
2,1.0,3,0.0,No,3.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,4.0,5,5.0,Yes,5.0,5.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,6.0,4,4.0,No,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
5,8.0,5,5.0,Yes,0.0,1.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0
6,5.0,7,5.0,Yes,6.0,6.0,3.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0
7,3.0,5,4.0,No,3.0,3.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
8,0.0,4,2.0,Yes,0.0,6.0,2.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0
9,4.0,5,5.0,No,3.0,2.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


In [None]:
#for col_name in survey.columns: 
#    print(col_name)

## **Load Pre-processed Dataset**


In [None]:
survey = pd.read_csv('p3_processed.csv', delimiter=',')

In [None]:
#for col_name in survey.columns: 
#    print(col_name)

## **Task 6: Build and evaluate classification models.**

We can chose which feature columns to include in building the model. We can tune your model by using any combination of parameter values. 

For model evaluation: 

1.   Partition the dataset into training set and test set. The test set shouldn't be used in any way during training our model. 
2.   Use cross-validation in order to get more robust evaluation results. 
3.   After evaluation, you can train your model again on the whole dataset. Then the trained model can be made available to classify unseen instances in the future. Of course, in this assignment, we don't really have unseen instances to be applied. 

For model selection: 

1.   Model selection is the step for choosing the optimal model among multiple different types of models (e.g., a decision tree vs. a kNN classifier), or for tuning the hyperparameters (e.g., the maximum depth in a decision tree) in order to get the optimal model within the same family of models. 

2.  In model selection, we further partition the training set (from model evaluation) into train set and validation set.  (Here we call it 'train set', to make it clear it is a subset of the 'training set'.)

3.  Different models are trained using the train test and their performance on the validation set is used to select the best model and/or best hyperparameters. 

4. Model section itself can also use cross-validation. 

In [None]:
#Libraries required to work with
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

In [None]:
# Code for Task 6
#x_data = pd.DataFrame(survey, columns=['Degree', 'Years of coding experience', 'Years of experience in machine learning methods'])
x_data = pd.DataFrame(survey.drop(['Yearly salary'], axis=1))

survey['Yearly salary'] = survey['Yearly salary'].replace({'Yes': 1, 'No': 0})
y_target = survey['Yearly salary'].values

In [None]:
#To randomly partition data
train_feature, test_feature, train_class, test_class = train_test_split(
    x_data, y_target, stratify=y_target, random_state=20)

In [None]:
knn = KNeighborsClassifier(n_neighbors=4)
print(knn.fit(train_feature, train_class), "\n")
print("Training set score: {:.3f}".format(knn.score(train_feature, train_class)), "\n")
print("Test set accuracy: {:.2f}".format(knn.score(test_feature, test_class)), "\n")

prediction = knn.predict(test_feature)
print("Confusion matrix:")
print(pd.crosstab(test_class, prediction, rownames=['True'], colnames=['Predicted'], margins=True))
print()
print("Classification report:")
print(classification_report(test_class, prediction))

scores = cross_val_score(knn, x_data, y_target, cv=25)
print("Cross-validation scores: {}".format(scores))
print("Average cross-validation score: {:.2f}".format(scores.mean()))

KNeighborsClassifier(n_neighbors=4) 

Training set score: 0.844 

Test set accuracy: 0.77 

Confusion matrix:
Predicted    0   1  All
True                   
0          317  22  339
1           85  38  123
All        402  60  462

Classification report:
              precision    recall  f1-score   support

           0       0.79      0.94      0.86       339
           1       0.63      0.31      0.42       123

    accuracy                           0.77       462
   macro avg       0.71      0.62      0.64       462
weighted avg       0.75      0.77      0.74       462

Cross-validation scores: [0.72972973 0.78378378 0.75675676 0.77027027 0.74324324 0.68918919
 0.75675676 0.78378378 0.7972973  0.82432432 0.78378378 0.74324324
 0.77027027 0.7027027  0.7972973  0.74324324 0.75675676 0.77027027
 0.78378378 0.72972973 0.79452055 0.7260274  0.7260274  0.73972603
 0.73972603]
Average cross-validation score: 0.76


In [None]:
linearsvm = LinearSVC(random_state=0, tol=1e-1, max_iter=5000).fit(train_feature, train_class)
print("Training set score: {:.3f}".format(linearsvm.score(train_feature, train_class)), "\n")
print("Test set score: {:.3f}".format(linearsvm.score(test_feature, test_class)), "\n")

prediction = linearsvm.predict(test_feature)
print("Confusion matrix:")
print(pd.crosstab(test_class, prediction, rownames=['True'], colnames=['Predicted'], margins=True))
print()
print("Classification report:")
print(classification_report(test_class, prediction))

scores = cross_val_score(linearsvm, x_data, y_target, cv=15)
print("Cross-validation scores: {}".format(scores))
print("Average cross-validation score: {:.2f}".format(scores.mean()))

Training set score: 0.845 

Test set score: 0.801 

Confusion matrix:
Predicted    0    1  All
True                    
0          304   35  339
1           57   66  123
All        361  101  462

Classification report:
              precision    recall  f1-score   support

           0       0.84      0.90      0.87       339
           1       0.65      0.54      0.59       123

    accuracy                           0.80       462
   macro avg       0.75      0.72      0.73       462
weighted avg       0.79      0.80      0.79       462

Cross-validation scores: [0.86178862 0.78861789 0.80487805 0.75609756 0.83739837 0.8699187
 0.85365854 0.83739837 0.82926829 0.79674797 0.76422764 0.82113821
 0.79674797 0.82926829 0.7804878 ]
Average cross-validation score: 0.82


In [None]:
nb = GaussianNB().fit(train_feature, train_class)
print("Training set score: {:.3f}".format(nb.score(train_feature, train_class)), "\n")
print("Test set score: {:.3f}".format(nb.score(test_feature, test_class)), "\n")

prediction = nb.predict(test_feature)
print("Confusion matrix:")
print(pd.crosstab(test_class, prediction, rownames=['True'], colnames=['Predicted'], margins=True))
print()
print("Classification report:")
print(classification_report(test_class, prediction))

scores = cross_val_score(nb, x_data, y_target, cv=5)
print("Cross-validation scores: {}".format(scores))
print("Average cross-validation score: {:.2f}".format(scores.mean()))

Training set score: 0.765 

Test set score: 0.753 

Confusion matrix:
Predicted    0    1  All
True                    
0          268   71  339
1           43   80  123
All        311  151  462

Classification report:
              precision    recall  f1-score   support

           0       0.86      0.79      0.82       339
           1       0.53      0.65      0.58       123

    accuracy                           0.75       462
   macro avg       0.70      0.72      0.70       462
weighted avg       0.77      0.75      0.76       462

Cross-validation scores: [0.65582656 0.70731707 0.66395664 0.67750678 0.57723577]
Average cross-validation score: 0.66


In [None]:
tree = DecisionTreeClassifier(max_depth=4, random_state=0)
tree.fit(train_feature, train_class)
print("Training set score: {:.3f}".format(tree.score(train_feature, train_class)), "\n")
print("Test set score: {:.3f}".format(tree.score(test_feature, test_class)), "\n")

prediction = tree.predict(test_feature)
print("Confusion matrix:")
print(pd.crosstab(test_class, prediction, rownames=['True'], colnames=['Predicted'], margins=True))
print()
print("Classification report:")
print(classification_report(test_class, prediction))

scores = cross_val_score(tree, x_data, y_target, cv=13)
print("Cross-validation scores: {}".format(scores))
print("Average cross-validation score: {:.2f}".format(scores.mean()))

Training set score: 0.819 

Test set score: 0.781 

Confusion matrix:
Predicted    0    1  All
True                    
0          296   43  339
1           58   65  123
All        354  108  462

Classification report:
              precision    recall  f1-score   support

           0       0.84      0.87      0.85       339
           1       0.60      0.53      0.56       123

    accuracy                           0.78       462
   macro avg       0.72      0.70      0.71       462
weighted avg       0.77      0.78      0.78       462

Cross-validation scores: [0.78169014 0.78873239 0.77464789 0.76056338 0.83098592 0.79577465
 0.8028169  0.80985915 0.77464789 0.79577465 0.78873239 0.75352113
 0.76595745]
Average cross-validation score: 0.79




## **Document and explaination for our models and results.**



In this data set, I considered the training data as whose data set excluding the target data set column which is Yearly salary. This column is the one on which we are performing the test. 
The 4 models used for evaluation is K nearest neighbours, Navies Bayes, Linear Support Vector Machine, and Decision Tree. 
 We are going to choose a model based on Test set Accuracy, Confusion matrix, Classification report, Cross validation scores. 
Test set accuracy: It is the percentage of correct predictions for the dataset.
Confusion matrix: The important parameters in this are the 
•	True Positive: where both actual and predicted condition is positive
•	False Positive: where the actual condition is negative, but the predicted condition is positive.
•	False Negative: where the actual condition is positive, but the predicted condition is negative.
•	True Negative: where the actual condition and the predicted condition is negative.


`Classification Report`: We fetch the F1-Score, Precision, Recall and accuracy. These are calculated from the confusion matrix result.
 
`Cross Validation`: It is the method used to avoid overfitting and estimate the skill of the model on new data.


`K Nearest neighbours`

k-nn is a data classification method that calculates how probable a data point is to belong to one of two groups based on which group the data points closest to it belong to.
Here, I have considered k=4, where missing data was approximated using the four closest individuals. 
From the confusion matrix we can see that the true positive values are 317 
The test set accuracy is 77%, the accuracy is also 77%. F1-score is 86% and average cross validation score is 76%.


`Linear Support vector machine`

Linear SVM has no kernel and finds the problem's smallest margin linear solution.
Here, I have increased the max_iter and cv(cross-validation) to 15 to have the data trained better.
The test set accuracy is 80.1%, the accuracy is also 80%. F1-score is 87% and average cross validation score is 82%.


`Navie Bayes`

The existence of a given feature in a class is assumed to be independent to the presence of any other feature by the Naive Bayes classifier.
The test set accuracy is 75.3%, the accuracy is also 75%. F1-score is 82% and average cross validation score is 66%.


`Decision Tree`

The data here is split continuously based on a particular parameter. There are multiple ways to do that as well.
The test set accuracy is 78.1%, the accuracy is also 78%. F1-score is 85% and average cross validation score is 79%.

From above results, we can conclude that the `Linear SVM model` has the highest numbers among all the other with respect to Accuracy, F1-score, Average cross validation score. Thus, I consider this Model for this data set. SVM is a technique that can be used to solve problems like classification and regression to transform your data and then based on these transformations it finds an optimal boundary between the possible outputs. SVM works well with clear margin of separation. It is effective in high dimensional spaces. When the number of dimensions exceeds the number of samples, it is effective. It is memory efficient because it uses a subset of training points (called support vectors) in the decision function.









References: 
1.	https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/
2.	https://towardsdatascience.com/cross-validation-explained-evaluating-estimator-performance-e51e5430ff85
3.	https://www.sciencedirect.com/topics/immunology-and-microbiology/k-nearest-neighbor
4.	https://medium.com/@dannymvarghese/comparative-study-on-classic-machine-learning-algorithms-part-2-5ab58b683ec0
5.	https://towardsdatascience.com/machine-learning-with-python-classification-complete-tutorial-d2c99dc524ec
6.	https://medium.com/analytics-vidhya/building-classification-model-with-python-9bdfc13faa4b
7.	https://www.analyticsvidhya.com/blog/2020/04/confusion-matrix-machine-learning/
8.	https://medium.com/coinmonks/support-vector-machines-svm-b2b433419d73




---



---

