### - Importing the necessary Python libraries -

In [None]:
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns 
import numpy as np
import pandas as pd
import numpy as np
sns.set_style("darkgrid", {"axes.facecolor": ".9"})

### - Importing data files -

In [None]:
deg = pd.read_csv('../input/college-salaries/degrees-that-pay-back.csv')
sal_col = pd.read_csv('../input/college-salaries/salaries-by-college-type.csv')
sal_reg = pd.read_csv('../input/college-salaries/salaries-by-region.csv')
dataset_list = [deg, sal_col, sal_reg]

### - Looking at the data -

In [None]:
deg.head(5)

In [None]:
sal_col.head(5)

In [None]:
sal_reg.head(5)

### - Finding out salary variable type -

In [None]:
type(deg['Starting Median Salary'][1])

From the above code, we can see that the salary variables are strings, meaning that we cannot make calculations with it. All of them need to be converted to numerical values and the dollar sign needs to be removed.

### - Fixing the salary variable type -

In [None]:
selected_col = ['Starting Median Salary','Mid-Career Median Salary','Mid-Career 10th Percentile Salary','Mid-Career 25th Percentile Salary','Mid-Career 75th Percentile Salary','Mid-Career 90th Percentile Salary']

for dataset in dataset_list:
    for col in selected_col:
        dataset[col] = dataset[col].str.replace("$","")
        dataset[col] = dataset[col].str.replace(",","")
        dataset[col] = pd.to_numeric(dataset[col])
sal_col.head()

By putting all the datasets in the dataset list and listing all the columns with what should be numerical values in a separate list, we can apply the code to all values that need changing at once. We can then remove the dollar sign and the comma from the string and convert the strings in question to numerical values. 

### - Getting a general idea  -

In [None]:
deg.describe()

### - Starting median salary by undergraduate major  -

In [None]:
deg = deg.sort_values("Starting Median Salary", ascending=False).reset_index(drop=True)
f, ax = plt.subplots(figsize=(15, 20)) 
ax.set_yticklabels(deg['Undergraduate Major'], rotation='horizontal', fontsize='medium')
figure1 = sns.barplot(y = deg['Undergraduate Major'], x= deg['Starting Median Salary'], palette="twilight")
plt.show()

### - Comparison of starting and mid-career salaries of top 10 careers -

In [None]:
deg_2 = deg[deg['Undergraduate Major'].isin(['Physician Assistant', 'Chemical Engineering', 'Computer Engineering', 'Electrical Engineering', 'Mechanical Engineering', 'Aerospace Engineering',
       'Industrial Engineering', 'Computer Science', 'Nursing',
       'Civil Engineering'])].reset_index(drop=True)
deg_2.head()

In [None]:
data = list()

for idx in deg_2.index:
    
    single_row = deg_2.loc[idx][['Undergraduate Major', 'Starting Median Salary', 'Mid-Career Median Salary']].values.tolist()
    print(single_row)

    single_row_reformated_1 = [single_row[0], single_row[1], 'Starting']
    single_row_reformated_2 = [single_row[0], single_row[2], 'Median']
    
    print(single_row_reformated_1)
    print(single_row_reformated_2)
    print("-----")

    data.append(single_row_reformated_1)
    data.append(single_row_reformated_2)
    data += [deg_2.loc[idx][['Undergraduate Major', 'Starting Median Salary', 'Mid-Career Median Salary']].values.tolist()]
    print(idx, data)

df_mod = pd.DataFrame(data, columns=['Undergraduate Major', 'Salary', 'Type'])
df_mod

In [None]:
f, ax = plt.subplots(figsize=(25, 12)) 
sns.barplot(x='Undergraduate Major', y='Salary', data=df_mod, palette="twilight", ax=ax)
#plt.legend(loc='upper right')

### - Disciplines with the fastest growing salary throughout the career  -

In [None]:
salary_change = deg.sort_values('Percent change from Starting to Mid-Career Salary', ascending=False).head(10)
f, ax = plt.subplots(figsize=(15, 10)) 
ax.set_xticklabels(salary_change['Undergraduate Major'], rotation='vertical', fontsize='medium')
figure2 = sns.barplot(x = salary_change['Undergraduate Major'], y= salary_change['Percent change from Starting to Mid-Career Salary'], palette="twilight")
plt.show()

Whereas engineering disciplines lead in the starting salary, philosophy and math have the biggest jump from the starting to mid-career salaries, followed by international relations, economics and marketing. 

### - Which regions offer highest salaries  -

In [None]:
sns.set_palette("twilight")

In [None]:
region_figure = sal_reg[['Starting Median Salary','Mid-Career Median Salary','Mid-Career 75th Percentile Salary','Region']]
var = region_figure.groupby('Region').mean()
var.plot(kind='line', figsize=(15, 6));

It is evident that California and Northeastern Region offer higher salaries than other regions in US. Interestingly, it is clearly visible that the median salary growth from the starting to the 75th percentile salary is consistent throughout all the regions (those who start off higher, finish higher). 

### - Salary comparison by school type -

In [None]:
uni_group = sal_col.groupby("School Type").mean().sort_values(by="Starting Median Salary",ascending=False)
f,ax = plt.subplots(figsize=(15,7))
uni_group.plot(ax=ax)
ax.set_ylabel("Salary")
ax.set_xlabel("Type of University")

When it comes to school type, Ivy League is clearly leading with highest salaries throughout the career, with an exception at the 10th Percentile. Engineering is second, even though Liberal Arts surpasses it when it comes to 90th Percentile. Party and State universities have significantly smaller salaries throughout. 