# Introduction 

I previously answered [this question](https://www.kaggle.com/andrewmvd/data-analyst-jobs/tasks?taskId=1586) in [this notebook](http://https://www.kaggle.com/erickdcohen/is-there-a-correlation-between-rating-and-salary) that was done in R. The post author asked for the same in python. Therefore, I am recreating this notebook in python. I have been coding in R throughout the majority of my career, nevertheless, I wanted to show that it can be recreated using python as well. The end values from my notebook in R are slightly different from what I have here due to other feature engineering conducted in the previous notebook that will not be present here. The concepts are all be the same.

Let's get started and tackle the original question

# The Original Question 

There was a [task](https://www.kaggle.com/andrewmvd/data-analyst-jobs/tasks?taskId=1586) that asked if there was a correlation between *salary* and *rating*. 

# The Short Answer - And a New Question

You can always calcualte the correlation coefficient for two variables; **the better question is: Is the correlation between *salary* and *rating* statistically significant?** If you are not sure what statistical significance is, I encourage you to read about it! Understanding the statistics behind models will make you a better data scientist. 


**Before we can answer the question, we must do some data preprocessing and think about the dataset. I would caution that the results of this analysis can only be specified to this dataset, which is not representative about jobs or salaries as a whole, and more importantly stress that CORRELATION IS NOT CAUSATION!!!**

**So please interpret the results with caution.**

Nevertheless, lets do some data preprocessing:

In [None]:
!pip install pyjanitor

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import os
from janitor import clean_names
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Let's read in the dataset and inspect the first few rows

In [None]:
analystJobsRaw = pd.read_csv("/kaggle/input/data-analyst-jobs/DataAnalyst.csv").clean_names()

analystJobsRaw.head()

# Feature Engineering

Because the `salary_estimate` series only provides a range, we must feature engineer the average salary from the range. We will take advantage of *regular expressions* to extract the numbers we need. 

In [None]:
# Extract the salary estimate and rating from the raw df 
analystDf = analystJobsRaw[['salary_estimate', 'rating']]

# Feature engineer the average salary 
analystDf['lower_bound_salary'] = analystDf['salary_estimate'].str.extract(r'(\d{2})').astype('double') * 1000 # Make lower bound salary 
analystDf['upper_bound_salary'] = analystDf['salary_estimate'].str.extract(r'(\d{2})(?=K\s\()').astype('double') * 1000 # Make upper bound salary 
analystDf['average_salary'] = (analystDf['lower_bound_salary'] + analystDf['upper_bound_salary']) / 2

analystDf = analystDf.dropna()


analystDf.head(10)

# Quick plot 

A quick scatter plot of our data appears to show there is no significant linear correlation. But let's proceed! 

In [None]:
plt.scatter(analystDf.iloc[:, 4], analystDf.iloc[:, 1])
plt.xlabel("salary")
plt.ylabel("rating")

# Model

Now that we have our average salary, lets create a linear regression and calculate the correlation coefficient for `average_salary` explained by `rating`

In [None]:
print(analystDf.iloc[:, 4].describe())
print(analystDf.iloc[:, 1].describe())


print("Are there missing or Null data: " + str(analystDf.isnull().values.any())) 

In [None]:
x = analystDf.iloc[:, 4].values.reshape(-1, 1) # reshape our arrays for Linear Regression
y = analystDf.iloc[:, 1].values.reshape (-1, 1)

# Performing the Linear Regression
salaryModel = LinearRegression().fit(x, y)

# Printing the R squared correlation coefficient
print('The R squared correlation coefficient between rating and salary is: ' + str(salaryModel.score(x, y)))

As we can see from the correlation coefficient above, the value is basically zero. Despite this, it is still a good idea to take a look at the statistical outputs from our model. Getting into the habit of looking at the regression output is important for grasping the statistics behind our models. 

In [None]:
x_val = sm.add_constant(analystDf.iloc[:, 1])
est = sm.OLS(analystDf.iloc[:, 4], x_val)
est2 = est.fit()
print(est2.summary())

# Conclusion 

As we can see from the R squared correlation coefficient above, the large global F-statistic, enormous P-value, and other statistics from our output, the salary and rating from this dataset are not statistically significant. So the correlation between rating and salary is not significant for the values in this dataset. Again - it is important to stress that **Correlation is NOT causation** and the results of these tests can only be used to make judgements about this specific dataset. 

I hope this was helpful! 

\###