#  **Sample:** 
The sample included N=215, represented the SSC, HSC marks, Central and other board, gender, degree marks, streams and placements students of a college students.


#  **Measures:** 
The placement status of each students was measured according to their gender, different boards of HSC - SSC and degree streams.  


#  **Predictors:**
1. Decision of board to get placed in company with good package       
2. Association of marks with class 10 and class 12 students
3. Role of degree percentage in getting good package.
4. Placement numbers of boys and girls
5. Insights of unplaced students

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.style.use('seaborn')

In [None]:
data = pd.read_csv('../input/factors-affecting-campus-placement/Placement_Data_Full_Class.csv')
data.head()

In [None]:
data.info()

In [None]:
salary = data.salary.dropna(inplace=True)

In [None]:
data.salary.head()

In [None]:
sns.boxplot(x=data.gender, y=data.salary, data=data);

In [None]:
salary_record = data.salary.groupby(data.gender)
print(salary_record.mean())
print(salary_record.std())

In [None]:
gender_salary_record = data.status.groupby(data.gender)
gender_salary_record.value_counts()

In [None]:
sns.countplot(data.status, hue=data.gender);

* Boys are getting more placement than girls.
* Placement ratio is 2:1 and non-placement 10:7
* Highest package is offered to boys.      

In [None]:
dept_salary_record = data.status.groupby([data.degree_t, data.gender])
dept_salary_record.value_counts()

In [None]:
sns.countplot(data.status, hue=data.degree_t);

In [None]:
salary_records = data.status.groupby([data.degree_t, data.gender, data.hsc_s])
salary_records.value_counts()

* Communication and Management department is getting more placement followed by Science and Technology.
* Placements are very less for other departments.

In [None]:
sns.boxplot(data=data, x='salary', y='ssc_p');

In [None]:
sns.swarmplot(data=data, x='status', y='degree_p', hue='gender');

In [None]:
sns.swarmplot(data=data, x='status', y='hsc_p', hue='gender');

In [None]:
sns.swarmplot(data=data, x='status', y='ssc_p', hue='gender');

Students who scored less than 60 in SSC and HSC exam are mostly not getting placements.

In [None]:
sns.regplot(x='ssc_p', y='hsc_p', data=data);

* Students who performed good in SSC, maintained same in HSC exam also.
* Some students who scored less than 60, scored less in HSC while other students imporved there score in HSC exam.
* Linear graph shows improvement insights

# Regression Modeling

* Binary Response Variable: ‘status’
* Explanatory Variables: 'ssc_p', 'hsc_p', 'degree_p', 'etest_p', 'mba_p'

In [None]:
import statsmodels.api
import statsmodels.formula.api as smf

In [None]:
data_copy = data.copy()

In [None]:
data_copy.ssc_p = data_copy.ssc_p.subtract(data_copy.ssc_p.mean())
data_copy.hsc_p = data_copy.hsc_p.subtract(data_copy.hsc_p.mean())
data_copy.degree_p = data_copy.degree_p.subtract(data_copy.degree_p.mean())
data_copy.etest_p = data_copy.etest_p.subtract(data_copy.etest_p.mean())
data_copy.mba_p = data_copy.mba_p.subtract(data_copy.mba_p.mean())

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data_copy.status = le.fit_transform(data.status)
data_copy.status.value_counts()

In [None]:
sns.regplot(data=data_copy, x='ssc_p', y='status', label='SSC')
sns.regplot(data=data_copy, x='hsc_p', y='status', label='HSC')
sns.regplot(data=data_copy, x='degree_p', y='status', label='DEGREE')
sns.regplot(data=data_copy, x='mba_p', y='status', label='MBA')
plt.legend();

In [None]:
reg1 = smf.ols('status ~ ssc_p', data=data_copy).fit()
reg2 = smf.ols('status ~ hsc_p', data=data_copy).fit()
print(reg1.summary())
print(reg2.summary())

p value is 0 that means there is positive significant association between hsc_p and ssc_p with placement status. 
Regression Line is straight, but the curve is not best fit for the data points. 

# Binary Tree

1. Decision tree analysis was performed to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable. 
2. All possible separations(categorical) or cut points(quantitative) are tested. 
3. For the analyses, the ‘status’ is used to grow the tree.

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import sklearn.metrics
from sklearn import tree

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier

from sklearn.linear_model import LassoLarsCV
from sklearn.metrics import mean_squared_error
from sklearn import preprocessing

In [None]:
le = LabelEncoder()
data_copy.status = le.fit_transform(data.status)
data_copy.status.value_counts()

In [None]:
features = data_copy[['ssc_p', 'hsc_p', 'degree_p', 'etest_p', 'mba_p']]
targets = data_copy.status

ftrain, ftest, ttrain, ttest = train_test_split(features, targets, train_size=0.4)
print(ftrain.shape)
print(ftest.shape)
print(ttrain.shape)
print(ttest.shape)

In [None]:
classifier = DecisionTreeClassifier()
classifier = classifier.fit(features, targets)
prediction = classifier.predict(ftest)

sklearn.metrics.accuracy_score(ttest, prediction)

In [None]:
ax = sns.distplot(ttest, kde=False, color='r', hist=True)
sns.distplot(prediction, kde=False, ax=ax, color='g', hist=True);

Accuracy is 100% so both graph gets overlapped 

# Random Forest

Random forests provide importance scores for each explanatory variable and also allow you to evaluate any increases in correct classification with the growing of smaller and larger number of trees. 

In [None]:
classifier = RandomForestClassifier(n_estimators=25)
classifier = classifier.fit(features, targets)
prediction = classifier.predict(ftest)

sklearn.metrics.accuracy_score(ttest, prediction)

In [None]:
extra = ExtraTreesClassifier()
extra.fit(ftrain, ttrain)

extra.feature_importances_

In [None]:
trees = range(50)
accuracy = np.zeros(50)

for item in trees:
    classifier = RandomForestClassifier(n_estimators = item+1)
    classifier = classifier.fit(features, targets)
    prediction = classifier.predict(ftest)
    accuracy[item] = sklearn.metrics.accuracy_score(ttest, prediction)

accuracy

In [None]:
plt.plot(trees, accuracy)
plt.ylabel('Accuracy Score')
plt.xlabel('Number of Trees');

* The accuracy of the random forest was 100%.
* ssc percentage is more associated with placement status rate with 31%. The graph stats that, the accuracy of the model is maintained somehow between 0.978 - 1.

# **Lasso Regression**

* Lasso regression is often used in machine learning to select the subset of variables. 
* The LASSO imposes a constraint on the sum of the absolute values of the model parameters, where the sum has a specified constant as an upper bound. 
* This constraint causes regression coefficients for some variables to shrink towards zero. This model selects only the most important predictors.

In [None]:
features_data = features.copy()

features_data.ssc_p = preprocessing.scale(features_data.ssc_p.astype('float64'))
features_data.hsc_p = preprocessing.scale(features_data.hsc_p.astype('float64'))
features_data.degree_p = preprocessing.scale(features_data.degree_p.astype('float64'))
features_data.etest_p = preprocessing.scale(features_data.etest_p.astype('float64'))
features_data.mba_p = preprocessing.scale(features_data.mba_p.astype('float64'))

In [None]:
ftrain, ftest, ttrain, ttest = train_test_split(features_data, targets, train_size=0.4, random_state=123)
print(ftrain.shape)
print(ftest.shape)
print(ttrain.shape)
print(ttest.shape)

In [None]:
model = LassoLarsCV(cv=10, precompute=False).fit(ftrain, ttrain)
dict(zip(features_data.columns, model.coef_))

In [None]:
m_log_alphas = -np.log10(model.alphas_)
ax = plt.gca();
plt.plot(m_log_alphas, model.coef_path_.T);
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label = 'alpha CV');
plt.ylabel('Regression Coefficients')
plt.xlabel('-log(alpha)')
plt.title('Regression Coefficients Progression for Lasso Paths');

* So the results show that of the 5 predictor variables, 3 were retained in the selected model. 
* During the estimation process, ssc percentage were most strongly associated with placement status, followed by hsc percentage.