### Overview

In this data analysis I would like to inverstigate what is the best place to start data scientist career. I'm goint to see where a junior data scientist most likely can find a good job with market-level salary and where he or she should move to and which company should they apply to.

### Data analysis

Before start looking for the most suitable job, we need to do some data cleaning. I've already done it here: [EDA for Data scientist job](http://www.kaggle.com/efimovadaria/eda-for-data-scientist-job), so I will just copy-paste the code of data cleaning and salary parsing from there: 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import shapiro
from scipy.stats import anderson
from scipy.stats import normaltest
from scipy.stats import norm
import seaborn as sns
import numpy as np
from sklearn.preprocessing import StandardScaler
from scipy import stats
import re
import warnings
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
from wordcloud import WordCloud, STOPWORDS 
warnings.filterwarnings('ignore')
%matplotlib inline

data = pd.read_csv('../input/data-scientist-jobs/DataScientist.csv')
data.head()

data = data.drop('Unnamed: 0', 1)
data = data.drop('index', 1)
data = data.drop('Competitors', 1)
data = data.drop('Easy Apply', 1)

data = data.replace(-1, np.nan)
data["Rating"].interpolate(method='linear', direction = 'forward', inplace=True) 

data.drop(data[data['Headquarters'] == "-1"].index, inplace=True)
data.drop(data[data['Size'].str.contains("-1")].index, inplace=True)
data.drop(data[data['Type of ownership'].str.contains("-1")].index, inplace=True)
data.drop(data[data['Revenue'].str.contains("-1")].index, inplace=True)
data.drop(data[data['Sector'].str.contains("-1")].index, inplace=True)
data.drop(data[data['Industry'].str.contains("-1")].index, inplace=True)

In [None]:
HOURS_PER_WEEK = 40
WEEKS_PER_YEAR = 52
THOUSAND = 1000

def return_digits(x):
    result = re.findall(r'\d+', str(x))
    result = int(result[0]) if result else 0
    return result

def return_salary(string, isFrom):
    patternMain = None
    patternPerHour = None
    if(isFrom):
        patternMain = r'^\$\d+K';
        patternPerHour = r'^\$\d+';
    else:
        patternMain = r'-\$\d+K';
        patternPerHour = r'-\$\d+';
    
    result = None
    if('Per Hour' in string):
        result = re.findall(patternPerHour, str(string))
        result = return_digits(result[0]) if result else 0
        result = result * HOURS_PER_WEEK * WEEKS_PER_YEAR
    else:
        result = re.findall(patternMain, str(string))
        result = return_digits(result[0]) if result else 0
        result = result * THOUSAND
    return result

def return_average_salary(x):
    from_salary = return_salary(x, True)
    to_salary = return_salary(x, False)
    result = (from_salary+to_salary)/2
    return result

data['SalaryAverage'] =  data['Salary Estimate'].apply(return_average_salary)

In [None]:
print(data.shape)
print(data.columns)

def count_missing_values():
    for column in data:
        nullAmount = None
        if (is_numeric_dtype(data[column])):
            nullAmount = data[data[column] == -1].shape[0]
        else:
            nullAmount = data[data[column] == "-1"].shape[0]
        print('{}{},  \t{:2.1f}%'.format(column.ljust(20),nullAmount, nullAmount*100/data[column].shape[0]))
    
count_missing_values()

So now we don't have missing values and unnecessary columns.

To find where is the best place to start a data science career I should take into account the positions professional level, so I will split the data by levels (like Junior or Senior etc.) and continue the analysis with Junior positions

In [None]:
seniorData =  data[data['Job Title'].str.contains("Senior")|data['Job Title'].str.contains("Sr.")]
print(len(seniorData))

In [None]:
juniorData =  data[data['Job Title'].str.contains("Junior")|data['Job Title'].str.contains("Jr.")]
print(len(juniorData))

Since we have 40 observations for the Junior data science position and 596 for the Senior one. 

We will continue with juniorData. Let's take a look at plots:

In [None]:
print(sns.distplot(juniorData['SalaryAverage'], fit=norm))
fig = plt.figure()
res = stats.probplot(juniorData['SalaryAverage'], plot=plt)

It looks like the average salary is approximately bell-shaped and can be normally distributed. Let's check whether we have any outliers at the boxplot.

In [None]:
juniorData.boxplot(column=['SalaryAverage'])

It looks like we don't have a lot of outliers, so as a next step we can try to check normality using Shapiro Wilk test.

In [None]:
stat, p = shapiro(juniorData['SalaryAverage'])
print('Statistics=%.3f, p=%.3f' % (stat, p))

Let's set the significance level to 0.05, p-value is equal to 0.12. 0.12 > 0.05 which means that we don't have enough evidence to reject the null hypothesis and we can conclude that the average salary  for a junior position is normal distributed.


Now we can calculate an expected value for the salary of junior data science position:

In [None]:
juniorData["SalaryAverage"].mean()

We can see that the sample mean of our data is \\$102,550 which is almost 100,000. Let's tests whether the real expected salary is significantly different from \\$100,000. We can use t-test for that, because we've already checked that this distribution is normal and we don't have ouliers.
We also will make a 95% confidence interval for this value.

In [None]:
import statsmodels.stats.api as sms

print(stats.ttest_1samp(juniorData['SalaryAverage'], popmean=100000))

bounds = sms.DescrStatsW(juniorData['SalaryAverage']).tconfint_mean()
print(bounds)

We can see here that p-value is 0.62 > 0.05 which means we failed to reject the null hypothesis so we can conclude that the expected salary for the junior data scientist job is approximately equal to \\$100,000. We also can say with 95\% of confidence that the real expected value of the salary for the junior data scientist position would be between \\$92077 and \\$113022.

Now let's take a look at locations where we can most likely find a job to start our data science career

In [None]:
bestData = juniorData[(juniorData['SalaryAverage']>92077) & (juniorData['SalaryAverage']<113022)]
print(bestData.shape)

Let's look at some plots:

In [None]:
print(sns.countplot(y='Company Name',data=bestData, order = bestData['Company Name'].value_counts().index))

In [None]:
companyData = bestData[bestData['Company Name'].str.contains("Staffigo")]
print(companyData.shape)
print(sns.catplot(x="Location", y="SalaryAverage", hue = "Job Title", s = 20, data=companyData, aspect=1.5))

In [None]:
print(sns.countplot(y='Location',data=bestData, order = bestData['Location'].value_counts().index))

In [None]:
print(sns.catplot(x="Size", y="SalaryAverage", hue = "Sector", s = 10, data=bestData))

In [None]:
print(sns.countplot(y='Sector',data=bestData, order = bestData['Sector'].value_counts().index))

Plots above show us that if you want to start a data science career in USA, most of the junior-level positions with market-level salary you would find in Austin, TX or in Chicago, IL. Most of the companies would be not very big (51-200 employees) and they would be from IT sector.
Also you could apply to the company named "Staffigo Technical Services, LLC", which has more then others junior data science positions in different cities of USA. 

Of course it's always better to know which knowledge the company would want you to be aware of, so you can take a look at the words map of the "Job Description" part of our dataset:

In [None]:
stopwords = set(STOPWORDS) 
wordcloud = WordCloud(width = 500, height = 500, 
                background_color ='white', 
                stopwords = stopwords, 
                min_font_size = 10).generate(' '.join(bestData["Job Description"])) 
                         
plt.figure(figsize = (10, 10), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
  
plt.tight_layout(pad = 0) 
plt.show() 