# Unemployment analysis with Python


## 1.Problem Statement
Unemployment is measured by the unemployment rate which is the number of people who are unemployed as a percentage of the total labour force. We have seen a sharp increase in the unemployment rate during Covid-19. So, the analysis intends to shed light on the socio-economic consequences of the pandemic on India's workforce and labor market.

This dataset aids in comprehending the unemployment dynamics across India's states during the COVID-19 crisis. It offers valuable insights into how the unemployment rate, employment figures, and labor participation rates have been impacted across different regions in the country. 

## 2. Preparing the environment


In [None]:
#Loading libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')
%matplotlib inline

## 3.Data collecting
Loading our dataset into a pandas' dataframe to manipulate and analyse it.

In [None]:
unemp = pd.read_csv("/kaggle/input/unemployment-in-india/Unemployment in India.csv")

## 4. Data Exploring
Using exploratory data analysis (EDA), we are getting fimiliar with our dataset by knowing some informations and aggregations.

In [None]:
#Whole dataset information
unemp.info()

In [None]:
#Printing random 5 rows as an overview
unemp.sample(5)

In [None]:
# How many null values in each feature?
unemp.isna().sum()

In [None]:
#Generating some discriptive statistics for the numeric features
unemp.describe()

In [None]:
unemp['Area'].value_counts()

In [None]:
#Observations in different regions aren't with the same frequency
unemp['Region'].value_counts()

In [None]:
unemp[' Date'].value_counts()

In [None]:
#Important note: there's a typo in frequency values
unemp[' Frequency'].value_counts()

In [None]:
#Unemployment rate and people employed are negatively correlated (weak correlation)
unemp[' Estimated Employed'].corr(unemp[' Estimated Unemployment Rate (%)'])

### From exploring data we noticed that it needs some manipulation and cleaning before analyzing it.

## 5. Data Cleaning
Transforming row data into usable formats to ensure data is accurate.

In [None]:
# Dropping null values as it is completely random and doesn't depend on other variables
# As the whole data is recorded monthly, the frequency month can be considered redundant data. So it will be dropped
df2 = unemp.dropna().drop(columns=[' Frequency'])

df2

### Renaming Details:

The dataset provides insights into the unemployment scenario across different Indian states:

**region**: The states within India. **date**: The date when the unemployment rate was recorded. **est_unemp_perc**: The percentage of individuals unemployed in each state of India. **est_mil_emp**: The count of millions currently employed. **est_labour_perc**: The proportion of the working population (age group: 16-64 years) participating in the labor force, either employed or actively seeking employment.

In [None]:
# 1. Renaming columns for easier access
# 2. Reset the index
# 3. rounding estimated employed column for a better visualizing
df3 = df2.rename(columns={"Region" : "region", ' Date' : 'date', ' Estimated Unemployment Rate (%)' : 'est_unemp_perc', ' Estimated Employed' : 'est_mil_emp',
                          ' Estimated Labour Participation Rate (%)' : 'est_labour_perc', 'Area' : 'area'}).reset_index(drop = True)
df3['est_mil_emp'] = (df3['est_mil_emp'] / 1000000).round(2)
df3

In [None]:
# Ensuring that there isn't any duplicates
df3.duplicated().sum()

In [None]:
#Searching for outliers
sns.boxplot(data= df3, x='region', y='est_unemp_perc')
plt.title('Detecting outliers in Unemploment rates per region')
plt.ylabel('Unemloyment percentage')
plt.xticks(rotation = 90)
plt.show()

There's many outliers but we can't remove them as they still represent valid information.

### The dataset is clean now. Next we will visualize it to gather some insights

## 6. Data visualization
This is the stage where we will analyze the data and get some answers. 

Let's see the change in percentage of particiation rate per month and see if we can get another insights

In [None]:
sns.lineplot(data=df3, x='date', y='est_labour_perc', hue='area', errorbar= None)
plt.title('Percentage of labour participantion rate per month')
plt.ylabel('labour participantion rate')
plt.xticks(rotation= 70)
plt.figtext(x= 0.48, y= 0.2 , s= 'The participation rate\nof both areas reach\nits lowest in april 2020\nas it was the peak\nof the crisis')
plt.show()

Discovering the differences in unemployment percentages among regions should tell us the most affected region of the Covid-19 crisis.

In [None]:
df5 = df3.groupby('region')[['est_unemp_perc']].mean().sort_values(by='est_unemp_perc',ascending= False)
df5.plot(kind='bar')
plt.title('Estimated unemployment percentage per region')
plt.ylabel('Unemployment percentage')
plt.xticks(rotation= 90)
plt.figtext(x= 0.19, y= 0.7, s= 'Tripura has the highest\nunemloyment percentage\nof all regions')
plt.show()

The average employed per region will tell us if there a significant difference in employees among states as well as the regions with the highest and lowest number of employees

In [None]:
df4 = df3.groupby('region')[['est_mil_emp']].mean().sort_values(by='est_mil_emp',ascending= False)
df4.plot(kind='bar')
plt.title('Average millions employed per region')
plt.ylabel('Millions employed')
plt.figtext(x=0.15, y=0.75, s='While the numbers of employees have\na big variance among regions,\nUttar Pradesh has the most employees')
plt.show()


Which area has a lower unemployment rate

In [None]:
sns.barplot(df3, x='area', y= 'est_unemp_perc', errorbar=None)
plt.figtext(x= 0.15, y= 0.78 , s= 'Rural areas has a lower\nunemployment percentage\nthan urban areas.')
plt.title("Average unemployment percentage per area")
plt.ylabel('Unemloyment percentage')
plt.show()


Which state the crisis had affected mostly

In [None]:
ax= df3.groupby('region')['est_unemp_perc'].agg(lambda x: max(x) - min(x)).sort_values(ascending=False).plot(kind='bar')
plt.suptitle('The difference in unemployment rate per region.')
plt.title('Maximum rate - minimum rate')
plt.figtext(x= 0.15, y= 0.75, s='Puducherry is the most affected\nby the crisis with more then\n70 percent difference in unemployment rate ')
plt.show()

## Conclusion
After analyzing the dataset we gained some insights that may give us a brief about the effects of Covid-19 on different states of India. Firstly, the peak of the crisis in april 2020 caused a huge decrease in the labour participation rate as it had reached its lowest during that period. Secondly, from the visualizations we saw that urban areas had a higher unemployment rate than rural areas. Although there's a states like Meghalaya that had the least employees, it has the least unemployment rate too. While other states like Puducherry had been affected badly from the crisis.