# Data Analysis with Python

1. To perform data analysis in Python, you need to import the necessary libraries.
The commonly used libraries are pandas, numpy, matplotlib, and seaborn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns



2. Loading the Dataset
You can obtain the dataset from an online source and load it into a pandas DataFrame. There are various methods to load different file formats like CSV, Excel, and SQL.


In [10]:
df = pd.read_csv('E:\Degree\data science\projects\data analysis with python\data science salary\df.csv')
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M


3. Exploring the Dataset
After loading the dataset, you can perform various operations to explore and understand the data.

In [13]:
# Get the column names
df.columns

# Export the DataFrame to different file formats
df.to_csv('data.csv')
df.to_json('data.json')
df.to_excel('data.xlsx')


In [15]:

# Check the data types of columns
df.dtypes


work_year              int64
experience_level      object
employment_type       object
job_title             object
salary                 int64
salary_currency       object
salary_in_usd          int64
employee_residence    object
remote_ratio           int64
company_location      object
company_size          object
dtype: object

In [19]:

# Change the data types of columns
df['company_loc'] = df['company_location'].astype('category')
df.head()


Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size,company_loc
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L,ES
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S,US
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S,US
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M,CA
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M,CA


In [20]:
# Check the summary statistics of the DataFrame
df.describe()



Unnamed: 0,work_year,salary,salary_in_usd,remote_ratio
count,3755.0,3755.0,3755.0,3755.0
mean,2022.373635,190695.6,137570.38988,46.271638
std,0.691448,671676.5,63055.625278,48.58905
min,2020.0,6000.0,5132.0,0.0
25%,2022.0,100000.0,95000.0,0.0
50%,2022.0,138000.0,135000.0,0.0
75%,2023.0,180000.0,175000.0,100.0
max,2023.0,30400000.0,450000.0,100.0


In [21]:
# Get concise summary information of the DataFrame
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3755 entries, 0 to 3754
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   work_year           3755 non-null   int64   
 1   experience_level    3755 non-null   object  
 2   employment_type     3755 non-null   object  
 3   job_title           3755 non-null   object  
 4   salary              3755 non-null   int64   
 5   salary_currency     3755 non-null   object  
 6   salary_in_usd       3755 non-null   int64   
 7   employee_residence  3755 non-null   object  
 8   remote_ratio        3755 non-null   int64   
 9   company_location    3755 non-null   category
 10  company_size        3755 non-null   object  
 11  company_loc         3755 non-null   category
dtypes: category(2), int64(4), object(6)
memory usage: 306.1+ KB


4. Data Cleaning and Preprocessing
Data cleaning involves handling missing values, replacing values, normalizing data, and creating dummy variables.


In [24]:
# Handling missing values
df.dropna()  # Drop rows with missing values
df.replace(np.nan,0)  # Replace specific values


Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size,company_loc
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L,ES
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S,US
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S,US
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M,CA
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M,CA
...,...,...,...,...,...,...,...,...,...,...,...,...
3750,2020,SE,FT,Data Scientist,412000,USD,412000,US,100,US,L,US
3751,2021,MI,FT,Principal Data Scientist,151000,USD,151000,US,100,US,L,US
3752,2020,EN,FT,Data Scientist,105000,USD,105000,US,100,US,S,US
3753,2020,EN,CT,Business Data Analyst,100000,USD,100000,US,100,US,L,US


In [29]:
# Data normalization using z-score
df['salary_with_zscore'] = (df['salary'] - df['salary'].mean()) / df['salary'].std()
df.head

<bound method NDFrame.head of       work_year experience_level employment_type                 job_title  \
0          2023               SE              FT  Principal Data Scientist   
1          2023               MI              CT               ML Engineer   
2          2023               MI              CT               ML Engineer   
3          2023               SE              FT            Data Scientist   
4          2023               SE              FT            Data Scientist   
...         ...              ...             ...                       ...   
3750       2020               SE              FT            Data Scientist   
3751       2021               MI              FT  Principal Data Scientist   
3752       2020               EN              FT            Data Scientist   
3753       2020               EN              CT     Business Data Analyst   
3754       2021               SE              FT      Data Science Manager   

         salary salary_currency  

In [34]:
# Binning data into categories
bins = np.linspace(min(df['salary_with_zscore']), max(df['salary_with_zscore']), 4)
group_names = ['Low', 'Medium', 'High']
df['binned'] = pd.cut(df['salary_with_zscore'], bins, labels=group_names, include_lowest=True)

df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size,company_loc,salary_with_zscore,binned
0,2023,SE,FT,Principal Data Scientist,-0.164805,EUR,85847,ES,100,ES,L,ES,-0.164805,Low
1,2023,MI,CT,ML Engineer,-0.239245,USD,30000,US,100,US,S,US,-0.239245,Low
2,2023,MI,CT,ML Engineer,-0.245945,USD,25500,US,100,US,S,US,-0.245945,Low
3,2023,SE,FT,Data Scientist,-0.023368,USD,175000,CA,100,CA,M,CA,-0.023368,Low
4,2023,SE,FT,Data Scientist,-0.105252,USD,120000,CA,100,CA,M,CA,-0.105252,Low


In [37]:
# Creating dummy variables
pd.get_dummies(df['work_year'])


Unnamed: 0,2020,2021,2022,2023
0,0,0,0,1
1,0,0,0,1
2,0,0,0,1
3,0,0,0,1
4,0,0,0,1
...,...,...,...,...
3750,1,0,0,0
3751,0,1,0,0
3752,1,0,0,0
3753,1,0,0,0


5. Exploratory Data Analysis (EDA)
EDA involves visualizing and analyzing the data to uncover patterns, relationships, and insights.

In [39]:
# Count of unique values in a column
df['salary_in_usd'].value_counts().to_frame()



Unnamed: 0,salary_in_usd
100000,99
150000,98
120000,91
160000,84
130000,82
...,...
234100,1
223800,1
172100,1
232200,1


In [42]:
# Grouping data
df.groupby('binned').mean()


Unnamed: 0_level_0,work_year,salary,salary_in_usd,remote_ratio,salary_with_zscore
binned,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Low,2022.375,-0.020566,137650.528252,46.25533,-0.020566
Medium,2020.5,16.093022,35997.0,50.0,16.093022
High,2021.0,44.975973,40038.0,100.0,44.975973


In [47]:
# Correlation analysis
from scipy import stats
pearson_coef, p_value = stats.pearsonr(df['salary_in_usd'], df['remote_ratio'])
pearson_coef, p_value


(-0.06417098519057554, 8.318979595835616e-05)

In [51]:
# Chi-square test
##from scipy.stats import chi2_contingency
#chi2, p, dof, expected = chi2_contingency(df)

6. Linear Regression
Performing linear regression involves fitting a linear model to the data and making predictions.

In [54]:
import numpy as np
from sklearn.linear_model import LinearRegression

# Assuming x is a 1D array-like object
x = df['salary_in_usd'].values.reshape(-1, 1)
y = df['remote_ratio']

# Creating the linear regression model
lm = LinearRegression()

# Fitting the model
lm.fit(x, y)


LinearRegression()

In [58]:
# Making predictions
y_pred = lm.predict(x)
y_pred

array([48.82928281, 51.5908342 , 51.81335254, ..., 47.88219534,
       48.12943793, 48.39324577])

In [65]:
# Intercept and coefficients
intercept = lm.intercept_
coefficients = lm.coef_
xv= ['intercept',intercept,'coefficients',coefficients]
xv

['intercept', 53.074289752719324, 'coefficients', array([-4.94485183e-05])]

7. Multiple Linear Regression
Multiple linear regression is used when you have multiple predictors.

In [66]:
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size,company_loc,salary_with_zscore,binned
0,2023,SE,FT,Principal Data Scientist,-0.164805,EUR,85847,ES,100,ES,L,ES,-0.164805,Low
1,2023,MI,CT,ML Engineer,-0.239245,USD,30000,US,100,US,S,US,-0.239245,Low
2,2023,MI,CT,ML Engineer,-0.245945,USD,25500,US,100,US,S,US,-0.245945,Low
3,2023,SE,FT,Data Scientist,-0.023368,USD,175000,CA,100,CA,M,CA,-0.023368,Low
4,2023,SE,FT,Data Scientist,-0.105252,USD,120000,CA,100,CA,M,CA,-0.105252,Low


In [70]:
# Selecting the columns for the model
X = df[['experience_level', 'remote_ratio', 'company_size']]
y = df['salary_in_usd']
# Creating the linear regression model
lm = LinearRegression()

# Fitting the model
lm.fit(x, y)


LinearRegression()

In [73]:
# Intercept and coefficients
intercept = lm.intercept_
coefficients = lm.coef_
xnv= ['intercept',intercept,'coefficients',coefficients]
xnv

['intercept', -2.9103830456733704e-11, 'coefficients', array([1.])]

8. Evaluating Model Accuracy
To evaluate the accuracy of the regression model, you can use techniques like train-test split and cross-validation


In [78]:
# Importing train_test_split and cross_val_score from sklearn
from sklearn.model_selection import train_test_split, cross_val_score
x = df['salary_in_usd'].values.reshape(-1, 1)
y = df['remote_ratio']
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)

# Cross-validation scores
scores = cross_val_score(lm, x, y, cv=10)
scores

array([-0.04649226, -0.07321749, -0.13745116, -0.0169858 , -0.06629484,
       -0.03666228, -0.01681506, -0.3675893 , -0.13809427, -0.64545954])

9. Ridge Regression
Ridge regression is a regularization technique that can be used to prevent overfitting in linear regression models

In [81]:
# Importing Ridge from sklearn
from sklearn.linear_model import Ridge

# Creating the Ridge regression model
ridge_model = Ridge(alpha=0.1)

# Fitting the model
ridge_model.fit(x, y)

# Making predictions
y_pred = ridge_model.predict(x)
y_pred

array([48.82928281, 51.5908342 , 51.81335254, ..., 47.88219534,
       48.12943793, 48.39324577])

10. Grid Search
Grid search is used to find the best hyperparameters for a given model.

In [82]:
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size,company_loc,salary_with_zscore,binned
0,2023,SE,FT,Principal Data Scientist,-0.164805,EUR,85847,ES,100,ES,L,ES,-0.164805,Low
1,2023,MI,CT,ML Engineer,-0.239245,USD,30000,US,100,US,S,US,-0.239245,Low
2,2023,MI,CT,ML Engineer,-0.245945,USD,25500,US,100,US,S,US,-0.245945,Low
3,2023,SE,FT,Data Scientist,-0.023368,USD,175000,CA,100,CA,M,CA,-0.023368,Low
4,2023,SE,FT,Data Scientist,-0.105252,USD,120000,CA,100,CA,M,CA,-0.105252,Low
