# CS3300 - Final Project
### By: Neel Desai and Tyler Faulkner
### Due: February 21, 2021

## Hypothesis

Some of the questions we are aiming to answer with this notebook are:

Do any variables correalate to if the student gets placed or not?

How well does a predictive classification model perform using correlated variables?

How does that model compare to models using other selections of variables?

If variables in the data set show correlation to the student's placement after college, then a relatively accurate predictive model can be trained to predict a student's placement post-college.


## Data Set

The data set we will be analyzing is the College Recruitment data which can be found at the link below on Kaggle:

https://www.kaggle.com/benroshan/factors-affecting-campus-placement

The data set was created to analyze what academic and employability factors influence whether a student gets placed into a career. The dataset was created from data from the information about MBA students during the January 2022 term from CMS Business School located in India. The original data can be found below:

https://github.com/DG1606/CMS-R-2020

The dataset includes 15 columns in total. The serial number column will not be used since it only serves as a unique identifier for each entry. The salary column will also not be used since it only applies to students who got placed into a career and does not apply to actually placing in a career in itself.

There are 8 categorical variables in the data and 5 numerical variables we will be using in total. The placed variable will be our target variable to see if we can predict.


### Imports

In [17]:
import pandas as pd
import numpy as np

## Data Preprocessing

The data set contains 215 unique entries. The only column that is missing values is the Salary column since only students who are placed in a career have a salary. Again, we will not be using the salary column so there is no need for imputation since we are dropping the column.

In [9]:
datapath = "Placement_Data_Full_Class.csv"

raw_data = pd.read_csv(datapath)

print(raw_data.info())

raw_data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   sl_no           215 non-null    int64  
 1   gender          215 non-null    object 
 2   ssc_p           215 non-null    float64
 3   ssc_b           215 non-null    object 
 4   hsc_p           215 non-null    float64
 5   hsc_b           215 non-null    object 
 6   hsc_s           215 non-null    object 
 7   degree_p        215 non-null    float64
 8   degree_t        215 non-null    object 
 9   workex          215 non-null    object 
 10  etest_p         215 non-null    float64
 11  specialisation  215 non-null    object 
 12  mba_p           215 non-null    float64
 13  status          215 non-null    object 
 14  salary          148 non-null    float64
dtypes: float64(6), int64(1), object(8)
memory usage: 25.3+ KB
None


Unnamed: 0,sl_no,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status,salary
0,1,M,67.0,Others,91.0,Others,Commerce,58.0,Sci&Tech,No,55.0,Mkt&HR,58.8,Placed,270000.0
1,2,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed,200000.0
2,3,M,65.0,Central,68.0,Central,Arts,64.0,Comm&Mgmt,No,75.0,Mkt&Fin,57.8,Placed,250000.0
3,4,M,56.0,Central,52.0,Central,Science,52.0,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed,
4,5,M,85.8,Central,73.6,Central,Commerce,73.3,Comm&Mgmt,No,96.8,Mkt&Fin,55.5,Placed,425000.0


We will be dropping the serial number and salary column since salary only applies to placed students and the serial number is simply a unique indentifier that has no correlation to the status.

In [12]:
clean_data = raw_data.drop(['sl_no', 'salary'], axis=1)

clean_data.head()

Unnamed: 0,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status
0,M,67.0,Others,91.0,Others,Commerce,58.0,Sci&Tech,No,55.0,Mkt&HR,58.8,Placed
1,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed
2,M,65.0,Central,68.0,Central,Arts,64.0,Comm&Mgmt,No,75.0,Mkt&Fin,57.8,Placed
3,M,56.0,Central,52.0,Central,Science,52.0,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed
4,M,85.8,Central,73.6,Central,Commerce,73.3,Comm&Mgmt,No,96.8,Mkt&Fin,55.5,Placed


The last part of our preprocessing is to convert the object columns into categories.

In [21]:
clean_data = clean_data.astype({'gender':'category',
                               'ssc_b':'category',
                               'hsc_b':'category',
                               'hsc_s':'category',
                               'degree_t':'category',
                               'workex':'category',
                               'specialisation':'category',
                               'status':'category'})
clean_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   gender          215 non-null    category
 1   ssc_p           215 non-null    float64 
 2   ssc_b           215 non-null    category
 3   hsc_p           215 non-null    float64 
 4   hsc_b           215 non-null    category
 5   hsc_s           215 non-null    category
 6   degree_p        215 non-null    float64 
 7   degree_t        215 non-null    category
 8   workex          215 non-null    category
 9   etest_p         215 non-null    float64 
 10  specialisation  215 non-null    category
 11  mba_p           215 non-null    float64 
 12  status          215 non-null    category
dtypes: category(8), float64(5)
memory usage: 11.2 KB


## Data Analysis and Visualization

## Data Modeling and Prediction

## Results Analysis