In [None]:
%matplotlib inline

In [None]:
import matplotlib.pyplot as plt 
import seaborn as sns
import numpy as np
import pandas as pd

In [None]:
degrees_that_pay_back = pd.read_csv("../input/degrees-that-pay-back.csv")

### Lets get familiar with the data first:

In [None]:
degrees_that_pay_back.head()

### Observations:

**1.** We see that the entries for all the salary columns have dollar sign in them which raises a suspision that they have to be dealt with first because most probably they are strings.<br>
**2. ** Percentage change column lies in between other columns, and mid-career median salary should be properly alligned, i.e. columns need some reordering for the ease of analysis.<br>
**3. ** The column names are very long so we need to rename them.

## Formulating questions :
- which UG_major's have high starting salary?
- what is the most common percent change in salaries ?

# 1.Data Cleaning

### a)Reordering and manupulating names of the columns :

In [None]:
cols = list(degrees_that_pay_back)

cols.insert(1,cols.pop(cols.index('Percent change from Starting to Mid-Career Salary')))
cols.insert(5,cols.pop(cols.index('Mid-Career Median Salary')))

cols

In [None]:
degrees_that_pay_back=degrees_that_pay_back.loc[:,cols]
degrees_that_pay_back.columns = cols=["Major",'Percentchange_start_to_mid','Start','Mid_10th','Mid_25th','Mid_50th','Mid_75th','Mid_90th']
degrees_that_pay_back.head(3)

In [None]:
df = degrees_that_pay_back


<font size =3>**Okay.. so now that columns have been reorderd and renamed, lets take a look at the data info**</font>

In [None]:
df.info()

In [None]:
type(df.Start[0])

<br>
<font size =4 >
We can see that only one coloumn is in float format and all the others are string and as we were lucky enough to have no missing values we can move on to the next step
</font>

### b) Convert all string values to float for all columns:

In [None]:
def convert(col):
#for col in columns:
    df[col] = df[col].map(lambda x : x.split('$')[1]).map(lambda x: float(x.split(',')[0]+x.split(',')[1]))


In [None]:
for col in df.columns[2:]:
    convert(col)

In [None]:
df.head(3)

<br>
<font size =4>
We have successfully converted and cleansed the data. Finally time for real analytical work
</font>

# 2. Data Analysis

### Lets take a look at statistics

In [None]:
df.describe()

<br><br>
<font size =4>
We can see that majority of the majors might have a starting salary in range of <u>44310</u> while depending on your performance starting salary might go upto <u>142766</u>(90th percentile)<br><br>
Lets sort the columns acc to highest salary to get insight on major with highest starting salary 
</font>
<br><br>

In [None]:
df.sort_values(by="Start",ascending=False,inplace=True)
df.reset_index(inplace=True)
df.head(4)

<br><br>
<font size =3>
Physician Asistant has the highest starting salary, but in the top four results itself we can see that the mid career salary for top performers may vary significantly as can be seen in the second case i.e. Chemical Engineering steals the light from Physician Assisstant.
<br>
<br>
From the statistics above we could also make conclusion that even if the start salary may be higher than the others,the salaries in **90th percentile** can be way higher and different from what was started, as the **max salary** in 90th percentile is **210000** which is way higher than what is seen for Physician Assisstant
</font>
<br>
<br>

# 3. Visualization

<br>
<font size = 4>
Lets do some visualization on this data to see how much is the variance in salaries for all performers:
</font>
<br>
<br>

In [None]:
f,ax = plt.subplots(figsize = (8,9))
df['index'] =sorted(df['index'],reverse=True) 
ax1 = df.plot(kind="scatter",x="Start",y="index",ax=ax,color='g',label="Start")
ax2 = df.plot(kind="scatter",x="Mid_10th",y="index",ax=ax,color='c',label="Mid_10th")
ax3 = df.plot(kind="scatter",x="Mid_25th",y="index",ax=ax,color='y',label="Mid_25th")
ax4 = df.plot(kind="scatter",x="Mid_50th",y="index",ax=ax,color='b',label="Mid_50th")
ax5 = df.plot(kind="scatter",x="Mid_75th",y="index",ax=ax,color='m',label="Mid_75th")
ax6 = df.plot(kind="scatter",x="Mid_90th",y="index",ax=ax,color='r',label="Mid_90th")

ax.set_xlabel("Salary")
ax.set_ylabel("Majors")

### In this scatterplot we can clearly see how the distribution of salary varies and becomes dispersed w.r.t the percentile you are in and which major you pursue. 
<br>
<font size =3>
For a bit more of clear visualization lets plot a swarmplot for the same observation as scatterplot
</font>
<br><br>


In [None]:
df1 = df.drop(["index","Percentchange_start_to_mid"],axis=1)

In [None]:
df1 = df1.melt(id_vars="Major",value_vars=['Start','Mid_10th','Mid_25th','Mid_50th','Mid_75th','Mid_90th'],var_name="Category",value_name="Salary")
df1.head()

In [None]:
f,ax = plt.subplots(figsize=(8,7))
sns.swarmplot( data=df1, y="Category", x="Salary", hue="Major",orient="h",ax=ax)
ax.legend_.remove()

# THE COLORS REPRESENT MAJORS

### One surprising thing which we observe over here is that the salaries for Start is higher than that of the 10th percentile, which means there was a salary drop from the start to mid career for 10th percentile performers.

### We also observe how the distribution gets thinner but higher with the percentile, which we also observed in scatterplot.

<font size = 4 >
Lets now try to answer the question what is the most common percentage salary change from start to mid-career.
</font><br><br>

In [None]:
f,ax = plt.subplots(figsize=(9,6))
sns.distplot(df.Percentchange_start_to_mid,bins=10)

## Here we can make out that the most common percentage wise salary change from start to mid career is astonishingly in range of 60 to 70 %