## What is boosting?
Boosting refers to a family of algorithms that are able to convert weak learners to strong learners. The main principle of boosting is to fit a sequence of weak learners (models that are only slightly better than random guessing, such as small decision trees) to weighted versions of the data, where more weight is given to examples that were mis-classified by earlier rounds. The predictions are then combined through a weighted majority vote (classification) or a weighted sum (regression) to produce the final prediction. The principal difference between boosting and the committee methods such as bagging is that base learners are trained in sequence on a weighted version of the data.

- Import Libraries
- Import Dataset
- Data Exploration
- Data Visualization
    - Correlation between features
    - Lineplot
    - Barplot: No. of students from different boards
    - Catplot: Higher secondary % gender wise
    - Boxplot: Finding salary outliers
    - piechart: Most prefered stream
- Data Distribution
- Encoding categorical data
- Build a regression model(Salary predictor)
    - Preprocessing Data
    - Estimation by Multiple regressor
    - OLS model summery


## Import Libraries

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder


## Load Dataset

In [None]:
placement = pd.read_csv('../input/factors-affecting-campus-placement/Placement_Data_Full_Class.csv')
placement.head()

In [None]:
placement.shape

## Data Exploration

In [None]:
from pandas_profiling import ProfileReport
profile = ProfileReport(placement);

In [None]:
profile.to_widgets()

Averge percentage you require to get placed



In [None]:
df = pd.DataFrame.drop(placement,columns=["sl_no","ssc_b","hsc_b"])
df_new = df.groupby(by  = 'status').mean()
df_new

## Data visualisation

In [None]:
matrix = placement.corr()
plt.figure(figsize=(9,7))

#plot heat map
g=sns.heatmap(matrix,annot=True)

Which marks matter more for getting placed ?

In [None]:
plt.figure(figsize=(12,8))
plt.ylim([200000,450000])
sns.regplot(x="ssc_p",y="salary",data=placement)
sns.regplot(x="hsc_p",y="salary",data=placement)
sns.regplot(x="mba_p",y="salary",data=placement)
sns.regplot(x="etest_p",y="salary",data=placement)
plt.legend(["ssc percentage", "hsc percentage", "MBA", "E-test"])
plt.ylabel("mba percentage")
plt.xlabel("Percentage %")
plt.show()

**Observation** Number of students from central, other boards in different class

In [None]:
for cols in ["hsc_s","ssc_b","hsc_b"]:
    sns.countplot(x="ssc_b",hue="gender",data=placement);
    plt.ylabel("No. of students");
    plt.xlabel(cols);
    plt.show()

Higher secondary percentage gender wise



In [None]:
sns.catplot(x="ssc_b",y="ssc_p", hue='gender', data=placement, kind='boxen')
plt.ylabel("percentage")
plt.xlabel("boards")

**observation** -> Average percentage of girls in both boards is higher than boys


Work experience for stdents in different degrees

In [None]:
sns.catplot(x="workex",hue="degree_t",data=placement, kind="count")
plt.ylabel("No. of students")
plt.xlabel("work exp in different degrees");
sns.catplot(x="degree_t",hue="workex",data=placement, kind="count")
plt.ylabel("No. of students")
plt.xlabel("work exp in different degrees")

Finding the salary outliers

In [None]:
sns.catplot(y="salary",x="gender",data=placement, kind="box", hue="specialisation" );

which stream is prefered by students the most

In [None]:
placement['degree_t'].value_counts(normalize=True).plot.pie(autopct='%1.1f%%')

Students that got placement

In [None]:
placement['status'].value_counts().plot.pie(autopct='%1.1f%%')

## Data Distribution

In [None]:
placement.hist(bins = 20, figsize=(10,10), color= 'green');

In [None]:
import plotly.express as px
dfc=pd.DataFrame(placement.groupby(['gender','specialisation','status'])['sl_no'].count()).rename(columns={'sl_no': 'no. of students'}).reset_index()

fig = px.sunburst(dfc, path=['gender','status','specialisation'], values='no. of students')
fig.update_layout(title="Placement % of mba in each specialisation by gender ",title_x=0.5)
fig.show()


## Encoding the categorical data

In [None]:
category =  [cols for cols in df.columns if placement[cols].dtype == 'O']

In [None]:
df.loc[:, category] = df.loc[:, category].apply(LabelEncoder().fit_transform)

df.head()

## Handle missing values

In [None]:
placement.isnull().sum()

In [None]:
placement.fillna(0, inplace=True)

## Pre processing data

In [None]:
df_reg = df.copy()
df_reg.dropna(inplace=True)
df_reg = df_reg[df_reg["salary"]<350000.0]

In [None]:
df_reg.info()

Skewness of salary plot

In [None]:
#PDF of Salary
sns.kdeplot(df["salary"])
plt.legend(["before"])
plt.show()

density plot is right skewed

In [None]:
sns.kdeplot(df_reg["salary"])
plt.legend(["after"])

In [None]:
#select the features of regression model
X = df_reg.iloc[:,:-2].values
y = df_reg.iloc[:,-1].values

#splitting into training and test set
#from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=0)

## OLS model summary

In [None]:
from statsmodels.api import OLS
summ=OLS(y_train,X_train).fit()
summ.summary()

## Estimation by multiple regressor

In [None]:
#from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

#Predict the salary
y_pred_m = regressor.predict(X_test)

In [None]:
from sklearn.metrics import r2_score, accuracy_score, mean_absolute_error
print("R2 score " + str(r2_score(y_test, y_pred_m)))
MAE = mean_absolute_error(y_test, y_pred_m)
print("MAE "+str(MAE))