<a href="https://www.bigdatauniversity.com"><img src = "https://ibm.box.com/shared/static/ugcqz6ohbvff804xp84y4kqnvvk3bq1g.png" width = 300, align = "center"></a>

# <center>Regression Method on CEO Salary Data </center>

# Table of Contents


<div class="alert alert-block alert-info" style="margin-top: 20px">
<li><a href="#ref0">Data Formatting and Preprocessing    </a></li>
<li><a href="#ref1"> Data Analysis</a></li>
<li><a href="#ref3">  Data Exploration </a></li>

<li><a href="#ref4"> Creating a Model </a></li>



</div>

 In this dataset, we will predict CEO pay using features such as the age of the CEO, company profits and sales. The dataset is from the Wisconsin School of Business's Regression and Modeling with Actual and Finance Applications  Datasets: 

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

 Let's load the dataset using Pandas:

In [None]:
file_name='https://ibm.box.com/shared/static/worxgoibhvr53yccwmwx71iuly491h6u.csv'
df=pd.read_csv(file_name)
df.head()

## <a id="ref0"></a> Data Formatting and Preprocessing  


 Let us re-scale the data:

In [None]:
df["COMP"]=1000*df["COMP"]
df["PROF"]=1000000*df["PROF"]

In [None]:
df.head()

Let's see the number of samples and features:

In [None]:
df.shape

We have 100 samples, 11 features and one target. 

 Let's look at the different data types:

In [None]:
df.dtypes

The columns **COMPANY** and **BIRTH** are not numerical. 

 Let’s see if we have missing data:

In [None]:
df.isnull().values.sum()

We have one missing data point, we can just drop the row using the method **dropna**.
Don’t forget if you don’t set the parameter **inplace** to **True** the dataframe will not change. 


In [None]:
df.dropna(axis=0, inplace=True)

We can verify that there is no missing data:  


In [None]:
df.isnull().values.sum()

## <a id="#ref2"> </a> Data Analysis

Let us get the summary statistics:

In [None]:
df.describe()

 Let’s see the most popular birth places  of the CEO’s  using the method  **value_counts()**, just a note **[0:10].plot(kind='bar')** will only take the top ten values and plot a bar graph.

In [None]:
df['BIRTH'].value_counts(normalize=True)[0:10].plot(kind='bar')

We see 6% of the CEOs are from New York.


 In this Lab we will not use categorical features so will drop the columns using the method **drop **, let’s see the dataframe before we drop the columns. 


In [None]:
df.head()

We drop the columns and view the results :

In [None]:
df=df.drop(labels=["COMPANY","BIRTH"], axis=1)
df.head()

## <a id='ref3'></a> Data Exploration 

In [None]:
import seaborn as sns

 Let’s examine the distribution of the data:  


In [None]:
df.hist(bins=10, figsize=(20,15))
plt.show()

We can view the correlation between the variables;

In [None]:
corr=df.corr()
corr

Let us view the correlation with a heat map:

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 8.27)
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True),square=True)
plt.show()

### <a id='ref4'></a> Creating a Model 


 We will need the following modules to build and test the model: 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import linear_model

 Let’s create a linear regression object:  


In [None]:
lr = linear_model.LinearRegression(normalize=True)

We extract the target data:

In [None]:
y=df[["COMP"]]

In [None]:
X=df.drop(labels="COMP", axis=1)

We split the data into a training and testing set: 

In [None]:
X_train, X_test, y_train,y_test = train_test_split(X, y, test_size=0.33,random_state=0)

We can fit the model and calculate the R^2:

In [None]:
lr.fit(X_train[["PROF"]],y_train)
lr.score(X_test[["PROF"]], y_test)

We can make a prediction:

In [None]:
yhat=lr.predict(X_test[['PROF']])
yhat[0:5]

In [None]:
sns.regplot(x='PROF', y="COMP", data=df)

 We can find the intercept and the coefficient: 

In [None]:
lr.intercept_

In [None]:
lr.coef_

We can calculate the R^2 for every feature:   

In [None]:
Feature_Name=[]
Feature_Rsq=[]

for name in list(X):
    lr.fit(X_train[[name]], y_train)
    lr.score(X_test[[name]], y_test)
    Feature_Rsq.append(lr.score(X_test[[name]], y_test))
    plt.figure()
    sns.regplot(x=name, y="COMP", data=df)

The following plot shows correlation and R^2  for the different features. We see a larger correlation implies a larger R^2.      


In [None]:
x_ax=np.arange(len(list(X)))
x_ax

plt.figure(figsize=(20,10))

plt.bar(x_ax, np.abs(corr.ix['COMP',1:].values), align='center', alpha=0.5,label='absolute value of correlation ')
plt.bar(x_ax, Feature_Rsq, align='center',label='R^2')
plt.xticks(x_ax, list(X))

plt.legend()

plt.title('Correlation vs R^2 of individual variables')
plt.show()

 We can use multiple linear regression as well. This leads to a higher R^2.

## Multiple Linear Regression

 We can also use multiple linear regression, the R^2 is approximately 0.155.

In [None]:
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

 We can make a prediction:

In [None]:
yhat=lr.predict(X_test)
yhat[0:5]

In [None]:
max(Rsq_ridge_poly)

#### About the Authors:  

[Joseph Santarcangelo]( https://www.linkedin.com/in/joseph-s-50398b136/) has a PhD in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.