<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Finding Correlation**


Estimated time needed: **30** minutes


In this lab, you will work with a cleaned dataset to perform exploratory data analysis (EDA). You will examine the distribution of the data, identify outliers, and determine the correlation between different columns in the dataset.


## Objectives


In this lab, you will perform the following:


- Identify the distribution of compensation data in the dataset.

- Remove outliers to refine the dataset.

- Identify correlations between various features in the dataset.


## Hands on Lab


##### Step 1: Install and Import Required Libraries


In [None]:
# Install the necessary libraries
!pip install pandas
!pip install matplotlib
!pip install seaborn

# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns




### Step 2: Load the Dataset


In [None]:
# Load the dataset from the given URL
file_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"
df = pd.read_csv(file_url)

# Display the first few rows to understand the structure of the dataset
df.head()
print(df.columns.tolist())

<h3>Step 3: Analyze and Visualize Compensation Distribution</h3>


**Task**: Plot the distribution and histogram for `ConvertedCompYearly` to examine the spread of yearly compensation among respondents.


In [None]:
## Write your code here
plt.figure(figsize=(10,5))
sns.distplot(df["ConvertedCompYearly"],hist=False)
plt.show()

In [None]:
plt.hist(df['ConvertedCompYearly'])
plt.show()

In [None]:
sns.histplot(df['ConvertedCompYearly'])

<h3>Step 4: Calculate Median Compensation for Full-Time Employees</h3>


**Task**: Filter the data to calculate the median compensation for respondents whose employment status is "Employed, full-time."


In [None]:
# Convert column names to a list
columns_list = df.columns.tolist()

# Print the list of column names
print(columns_list)

In [None]:
## Write your code here
full_time_employees = df[df['Employment'].str.contains('Employed, full-time',case=False, na=False)]

full_time_employees_cleaned = full_time_employees.dropna(subset=['ConvertedCompYearly'])

median_values = full_time_employees_cleaned['ConvertedCompYearly'].median()

print(median_values)

<h3>Step 5: Analyzing Compensation Range and Distribution by Country</h3>


Explore the range of compensation in the ConvertedCompYearly column by analyzing differences across countries. Use box plots to compare the compensation distributions for each country to identify variations and anomalies within each region, providing insights into global compensation trends.



In [None]:
## Write your code here
df_clean = df.dropna(subset=['ConvertedCompYearly', 'Country'])

# Create a box plot to compare compensation across countries
plt.figure(figsize=(25, 10))

# Plot a box plot for 'ConvertedCompYearly' by 'Country'
sns.boxplot(x='Country', y='ConvertedCompYearly', data=df_clean)

# Rotate x-axis labels for readability (if there are many countries)
plt.xticks(rotation=90)

# Set plot title and labels
plt.title('Distribution of Yearly Compensation by Country', fontsize=16)
plt.xlabel('Country', fontsize=14)
plt.ylabel('Yearly Compensation (USD)', fontsize=14)
plt.yscale('log')

# Show the plot
plt.tight_layout()  # Ensures the labels fit within the plot area
plt.show()

In [None]:
# Get the top 10 countries with the most respondents
top_countries = df_clean['Country'].value_counts().head(2).index
df_top_countries = df_clean[df_clean['Country'].isin(top_countries)]

# Create the box plot for these top 2 countries
plt.figure(figsize=(15, 10))
sns.boxplot(x='Country', y='ConvertedCompYearly', data=df_top_countries)
plt.xticks(rotation=90)
plt.title('Top 2 Countries by Number of Respondents - Compensation Distribution')
plt.yscale('log')
plt.tight_layout()
plt.show()

#These are the countries with the most respondents.
#we can see that USA  has more convertedCompYearly then Germany. Additionally, USA has more outliers then Germany. Also, we can see a normal distribution  for both.
#we can also see that the outliers are beyond the maximum for both.
# The middle line of the USA box lies outside of the Germany box.
#USA has a longer box which means that the data is more dispersed.

<h3>Step 6: Removing Outliers from the Dataset</h3>


**Task**: Create a new DataFrame by removing outliers from the `ConvertedCompYearly` column to get a refined dataset for correlation analysis.


In [None]:
## Write your code here

#Find the Inter Quartile Range First!
Q1 = df_clean['ConvertedCompYearly'].quantile(0.25)
Q3 = df_clean['ConvertedCompYearly'].quantile(0.75)
IQR = Q3 - Q1
print("IQR: ",IQR)

#Find out the upper and lower bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print("Lower Bound:", lower_bound)
print("Upper Bound:", upper_bound)

# your code goes here
outliers = (df_clean['ConvertedCompYearly'] < lower_bound) | (df_clean['ConvertedCompYearly'] > upper_bound)
num_outliers = outliers.sum()
print("With Outliers : ", df.shape)
print("How many outliers there are : ", num_outliers)

# Create a new dataframe with the removed outliers (using AND instead of OR)
outliers_removed_df = df_clean[(df_clean['ConvertedCompYearly'] >= lower_bound) & (df_clean['ConvertedCompYearly'] <= upper_bound)]
print("Filtered without Outliers: ", outliers_removed_df.shape)

<h3>Step 7: Finding Correlations Between Key Variables</h3>


**Task**: Calculate correlations between `ConvertedCompYearly`, `WorkExp`, and `JobSatPoints_1`. Visualize these correlations with a heatmap.


In [None]:
## Write your code here
df_correlations = outliers_removed_df[['ConvertedCompYearly','WorkExp','JobSatPoints_1']]
corr_matrix = df_correlations.corr()
corr_matrix
# Create a heatmap to visualize the correlations
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0)
plt.show()


# df_clean = df.dropna(subset=['ConvertedCompYearly', 'WorkExp','JobSatPoints_1'])
# df_new = df_clean[['ConvertedCompYearly','WorkExp','JobSatPoints_1']]
# df_new
# corr_matrix2 = df_new.corr()
# corr_matrix2
# plt.figure(figsize=(8, 6))
# sns.heatmap(corr_matrix2, annot=True, cmap='coolwarm', center=0)
# plt.show()

<h3>Step 8: Scatter Plot for Correlations</h3>


**Task**: Create scatter plots to examine specific correlations between `ConvertedCompYearly` and `WorkExp`, as well as between `ConvertedCompYearly` and `JobSatPoints_1`.


In [None]:
## Write your code here
plt.figure(figsize=(15, 6))
y= df['ConvertedCompYearly']
x= df['WorkExp']
plt.scatter(x,y)
plt.title('Work Experience and Converted Comp Yearly')
plt.xlabel('Work Experience')
plt.ylabel('Converted Comp Yearly')

plt.show()



In [None]:
## Write your code here
plt.figure(figsize=(15, 6))
y= df['ConvertedCompYearly']
x= df['JobSatPoints_1']
plt.scatter(x,y)
plt.title('Job Stat Points and Converted Comp Yearly')
plt.xlabel('Job Stat Points')
plt.ylabel('Converted Comp Yearly')

plt.show()

In [None]:
## Write your code here
plt.figure(figsize=(15, 6))

y= df['ConvertedCompYearly']
x= df['Age']
plt.scatter(x,y)
plt.title('Age and Converted Comp Yearly')
plt.xlabel('Age')
plt.ylabel('Converted Comp Yearly')

plt.show()

In [None]:
# Create the box plot for these top 2 countries
top_countries = outliers_removed_df['Country'].value_counts().head(10).index
df_top_countries = outliers_removed_df[outliers_removed_df['Country'].isin(top_countries)]

plt.figure(figsize=(15, 10))
sns.boxplot(x='Country', y='ConvertedCompYearly', data=df_top_countries)
plt.xticks(rotation=90)
plt.title('Top 2 Countries by Number of Respondents - Compensation Distribution')
plt.yscale('log')
plt.tight_layout()
plt.show()
print(outliers_removed_df.shape)

In [None]:
# Get the top 10 countries with the most respondents
top_countries = df['Country'].value_counts().head(10).index
df_top_countries = df[df['Country'].isin(top_countries)]

# Create the box plot for these top 2 countries
plt.figure(figsize=(15, 10))
sns.boxplot(x='Country', y='ConvertedCompYearly', data=df_top_countries)
plt.xticks(rotation=90)
plt.title('Top 2 Countries by Number of Respondents - Compensation Distribution')
plt.yscale('log')
plt.tight_layout()
plt.show()


<h3>Summary</h3>


In this lab, you practiced essential skills in correlation analysis by:

- Examining the distribution of yearly compensation with histograms and box plots.
- Detecting and removing outliers from compensation data.
- Calculating correlations between key variables such as compensation, work experience, and job satisfaction.
- Visualizing relationships with scatter plots and heatmaps to gain insights into the associations between these features.

By following these steps, you have developed a solid foundation for analyzing relationships within the dataset.


## Authors:
Ayushi Jain


### Other Contributors:
- Rav Ahuja
- Lakshmi Holla
- Malika


Copyright © IBM Corporation. All rights reserved.
