![](https://assets.vccircle.com/uploads/2018/08/GDPbyshahjunaid.jpg)

## GDP Analysis of the Indian States

#### This kernel is based on the assignment by IIITB collaborated with upgrad.

#### If this Kernel helped you in any way, some <font color="red"><b>UPVOTES</b></font> would be very much appreciated.

#### Briefing
We are working as the chief data scientist at NITI Aayog, reporting to the CEO. The CEO has initiated a project wherein the NITI Aayog will provide top-level recommendations to the Chief Ministers (CMs) of various states, which will help them prioritise areas of development for their respective states. Since different states are in different phases of development, the recommendations should be specific to the states.

 
The overall goal of this project is to help the CMs focus on areas that will foster economic development for their respective states. Since the most common measure of economic development is the GDP, we will analyse the GDP of the various states of India and suggest ways to improve it.

#### Understanding GDP 
Gross domestic product (GDP) at current prices is the GDP at the market value of goods and services produced in a country during a year. In other words, GDP measures the 'monetary value of final goods and services produced by a country/state in a given period of time'.

 

GDP can be broadly divided into goods and services produced by three sectors: the primary sector (agriculture), the secondary sector (industry), and the tertiary sector (services).

 

It is also known as nominal GDP. More technically, (real) GDP takes into account the price change that may have occurred due to inflation. This means that the real GDP is nominal GDP adjusted for inflation. We will use the nominal GDP for this exercise. Also, we will consider the financial year 2015-16 as the base year, as most of the data required for this exercise is available for the aforementioned period.

#### Per Capita GDP and Income
Total GDP divided by the population gives the per capita GDP, which roughly measures the average value of goods and services produced per person. The per capita income is closely related to the per capita GDP (though they are not the same). In general, the per capita income increases when the per capita GDP increases, and vice-versa. For instance, in the financial year 2015-16, the per capita income of India was ₹93,293, whereas the per capita GDP of India was $1717, which roughly amounts to ₹1,11,605. 

## Reading and Understanding Data

In [None]:
# Import the required Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob
from functools import reduce
from itertools import cycle, islice
pd.options.display.float_format='{:.4f}'.format
plt.rcParams['figure.figsize'] = [11.5,8]
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_colwidth', -1)

In [None]:
# File path. 

path= '../input/eda-gdp-analysis-india/'

In [None]:
# Reading the relevant file on which Analysis needs to be done

file = path + 'SGDP.csv'
dfx = pd.read_csv(file)
dfx.head(4)

In [None]:
# shape of data

dfx.shape

In [None]:
# Data description

dfx.describe()

In [None]:
# Data Information

dfx.info()

## Data Cleansing and Preparation

In [None]:
# Calculating the Missing Values % contribution in DF

df_null=dfx.isna().mean().round(4) * 100
df_null

In [None]:
# Dropping columns where all rows are NaN

dfx1 = dfx.dropna(axis = 1, how = 'all')

In [None]:
# Dropping the data for Duration 2016-17 as it will not be used in Analysis

dfx2 = dfx1[dfx1.Duration != '2016-17']

In [None]:
# Dropping the UT as it is not needed for Analysis

dfx3 = dfx2.T
dfx4 = dfx3.drop(labels = ['Andaman & Nicobar Islands','Chandigarh','Delhi','Puducherry'])
#dfx3

In [None]:
# Mean of the row (% Growth over previous year) for duration 2013-14, 2014-15 and 2015-16

dfx4_mean = dfx4.iloc[2:,6:10].mean(axis = 1).round(2).sort_values()
dfx4_mean

## Data Visualization and Insights Extraction

In [None]:
# Bar Plot for Average growth rates of the various states for duration 2013-14, 2014-15 and 2015-16
plt.rcParams['figure.figsize'] = [11.5,8]
dfx4_mean.plot(kind='barh',stacked=True, colormap = 'Set1')
plt.title("Avg.% Growth of States for Duration 2013-14, 2014-15 and 2015-16", fontweight = 'bold')
plt.xlabel("Avg. % Growth", fontweight = 'bold')
plt.ylabel("States", fontweight = 'bold')

### Insights from the above Plot considering the average growth rates of various states for duration 2013-2016
* States like Mizoram, Nagaland,Tripura and Manipur which are parts of our Eastern India has been consistently growing fast as compared to its peer states with avg growth rate of approx 15%.
* Goa and Meghalaya are struggling to grow as compared to other states.

In [None]:
# Average growth rate of my home state against the National average Growth rate

dfx4_myhome = dfx4_mean[['Madhya Pradesh', 'All_India GDP']]

In [None]:
dfx4_myhome.plot(kind='bar',stacked=True, colormap = 'Dark2')
plt.title("Avg. % Growth of Home State vs National Avg. for Duration 2013-14, 2014-15 and 2015-16", fontweight = 'bold')
plt.ylabel("Average % Growth", fontweight = 'bold')
plt.xlabel("Home State Vs National Average", fontweight = 'bold')

![](http://)### Insights from the above Plot considering the average growth rates of my Home state Vs National Average for duration 2013-2016
* Average growth rate of my home state Madhya Pradesh(~14%) is greater than the National Average growth rate(~12%).Performance of my state is better as compared to most of rest states because the state is rich in natural resources, fuels, minerals, agriculture and biodiversity *

 **Total GDP of the states for the year 2015-16**

In [None]:
#Selecting the GSDP for year 2015-16

dfx5_total_gdp = dfx4.iloc[2:,4:5]

In [None]:
# Dropping the GSDP of All_India as it will not be included in the plot

dfx6_total_gdp = dfx5_total_gdp.drop(labels = ['All_India GDP'])

In [None]:
#Plot for GSDP of all states including States with NaN

dfx6_total_gdp.sort_values(by=4).plot(kind='bar',stacked=True, colormap = 'Set1')
plt.title("Total GDP of States for duration 2015-16" , fontweight = 'bold')
plt.ylabel("Total GDP (in cr)",fontweight = 'bold')
plt.xlabel("States",fontweight = 'bold')

In [None]:
# Dropping the States whose GSDP in NaN for year 2015-16

dfx7_total_gdp = dfx6_total_gdp.dropna().sort_values(by = 4)

In [None]:
#Plot for GSDP of all states excluding States with NaN

dfx7_total_gdp.plot(kind='bar',stacked=True, colormap = 'autumn')
plt.title("Total GDP of States for duration 2015-16" , fontweight = 'bold')
plt.ylabel("Total GDP (in cr)",fontweight = 'bold')
plt.xlabel("States",fontweight = 'bold')

In [None]:
dfx7_total_gdp.shape

### Insights from the above Plot considering the GSDP of various states for duration 2015-16
*  GSDP of bigger states like TN and UP is higher as compared to smaller states like Sikkim and Arunachal Pradesh.*
*  GSDP of southern states like TN,Karnataka, Kerlala are better as compared to rest of the India.*
*  GSDP of most populous state Uttar Pradesh stands at position 2.
*  India's Silicon Valley Bangalore assisting Karnataka secure position 3.

In [None]:
# GSDP of Top 5 States
dfx7_total_gdp.tail(5).plot(kind='bar',stacked=True, colormap = 'Dark2')
plt.title("Total GDP of top 5 States for 2015-16", fontweight = 'bold')
plt.ylabel("Total GDP (in cr)",fontweight = 'bold')
plt.xlabel("States",fontweight = 'bold')


# GSDP of Bottom 5 States
dfx7_total_gdp.head(5).plot(kind='bar',stacked=True, colormap = 'Set1')
plt.title("Total GDP of bottom 5 States for 2015-16", fontweight = 'bold')
plt.ylabel("Total GDP (in cr)",fontweight = 'bold')
plt.xlabel("States",fontweight = 'bold')

### Insights from the above Plot considering the GSDP of top/bottom 5 states for duration 2015-16
*  The top 5 states contributes almost 1/3 rd (32%) of total GSDP.
*  There is a significant difference in GSDP between the 5th(Andhra Pradesh) state and the rest of the top 5 states.
*  The bottom 5 states contributes only 1.5 % to total GSDP.
*  The GSDP of J&K is significantly higher than the rest of the bottom states reason being traditional recreational tourism, a vast scope exists for adventure, pilgrimage, spiritual, and health tourism

### Reading the States GDP

In [None]:
# Reading all the csv files using glob functionality from a directory for further analysis

dir = path + 'N*.csv'

files = glob.glob(dir)

data = pd.DataFrame()

for f in files:
    dfs = pd.read_csv(f, encoding = 'unicode_escape')
    dfs['State'] = f.replace(path, '').replace('NAD-', '').replace('-GSVA_cur_2016-17.csv','').replace('-GSVA_cur_2015-16.csv','').replace('-GSVA_cur_2014-15.csv','').replace('_',' ')
    data = data.append(dfs)
data = data.iloc[:, ::-1]
sort=True

In [None]:
# Selecting the required columns for the Analysis

df = data[['State', 'Item', '2014-15']] 
df1 = df.reset_index(drop = True)

In [None]:
# Cleansing the columns name

df1['Item'] = df1['Item'].map(lambda x: x.rstrip('*')).copy()
df1 = df1.set_index('State')

In [None]:
# Pivoting the df for enhanced analysis of data

df2 = pd.pivot_table(df1, values = '2014-15', index=['Item'], columns = 'State').reset_index()
df3 = df2.set_index('Item',drop=True)
#df3

In [None]:
# Dropping the UT as it will not be used in further analysis

df4=df3.drop(['Andaman Nicobar Islands','Chandigarh','Delhi','Puducherry'],axis=1)

#### Plot the GDP per capita for all the states

In [None]:
df5_percapita = df4.loc['Per Capita GSDP (Rs.)'].sort_values()

In [None]:
#Plot for GDP per capita in Rs. for all states

df5_percapita.plot(kind='barh',stacked=True, colormap = 'gist_rainbow')
plt.title("GDP per Capita for All States for duration 2014-15", fontweight = 'bold')
plt.xlabel("GDP per Capita (in Rs.)",fontweight = 'bold')
plt.ylabel("States", fontsize = 12, fontweight = 'bold')

In [None]:
#Plot for GDP per Capita of top 5 States for 2014-15

df5_percapita.tail(5).plot(kind='bar',stacked=True, colormap = 'winter')
plt.title("GDP per Capita of top 5 States for 2014-15", fontweight = 'bold')
plt.ylabel("GDP per Capita (in Rs.)", fontweight = 'bold')
plt.xlabel("States", fontsize = 12, fontweight = 'bold')

In [None]:
#Plot for GDP per Capita of bottom 5 States for 2014-15

df5_percapita.head(5).plot(kind='bar',stacked=True, colormap = 'Set1')
plt.title("GDP per Capita of bottom 5 States for 2014-15", fontweight = 'bold')
plt.ylabel("GDP per Capita (in Rs.)", fontweight = 'bold')
plt.xlabel("States", fontweight = 'bold')

In [None]:
Goa_percapita = (df5_percapita['Goa']/df5_percapita.sum()*100).round(2)
Goa_percapita1 = (df5_percapita['Goa']/df5_percapita.mean()).round(2)
Goa_per_Bihar =  df5_percapita['Goa']/df5_percapita['Bihar']
Sikkim_percapita = (df5_percapita['Sikkim']/df5_percapita.sum()*100).round(2)
Bihar_percapita = (df5_percapita['Bihar']/df5_percapita.sum()*100).round(2)
UP_percapita = (df5_percapita['Uttar Pradesh']/df5_percapita.sum()*100).round(2)

### Insights from the above Plot considering the per capita GSDP of various states for duration 2014-15
* Goa being at top of the chart contributes 8.5% in National per capita GSDP.
* Goa's per capita GSDP is almost 2.5 times the National average per capita GSDP.
* Goa's per capita GSDP is 8 times the poorest state Bihar per capita GSDP.
* Bihar being at bottom of the chart contributes only 1% in National per capita GSDP.
* Sikkim’s economic development is based on advancement in tourism.
* UP being the most populous state of India stands in bottom 2 of the chart.
* Sikkim is doing exceptionally better as compared to her sister states.

In [None]:
# Ratio of the highest per capita GDP to the lowest per capita GDP

h_percapita = df5_percapita.iloc[-1]
l_percapita = df5_percapita.iloc[0]
percapita_ratio = (h_percapita/l_percapita).round(3)

percapita_ratio

*The per capita ratio of highest per capita state to lowest per capita state is 8.005*

#### Percentage contribution of the Primary, Secondary and Tertiary sectors as a percentage of the total GDP for all the states

In [None]:
# Selecting Primary Secondary and Tertiary sector for percentage contribution in total GDP

df_gdp_con = df4.loc[['Primary', 'Secondary', 'Tertiary','Gross State Domestic Product']]
df_gdp_percon = (df_gdp_con.div(df_gdp_con.loc['Gross State Domestic Product'])*100).round(2)
df_gdp_percon =df_gdp_percon.T.iloc[:,:3]

In [None]:
# Plot for % contribution of sectors in total GDP

df_gdp_percon.plot(kind='bar',stacked=True, colormap = 'prism')
plt.title("% Contribution of Primary, Secondary, Tertiary sector in total GDP for 2014-15",fontweight = 'bold')
plt.ylabel("% Contribution", fontweight = 'bold')
plt.xlabel("States", fontweight = 'bold')

### Insights from the above Plot considering the % Contribution of Primary, Secondary, Tertiary sector in total GDP for 2014-15
* Tertiary sector contribution in each states total GSDP is higher as compared to Primary Sector.
* For Manipur the % contribution by Tertiary sector is almost 65%.
* For Sikkim the Primary sector contributes the least, Seconday sector contributes more than half in states total GSDP helping it stand in toplist in terms of per capita GSDP.
* For Goa Primary sector contributes the least and most of the contribution is from Primary and Secondary sector making it top the percapita chart.
* for MP the % contribution from each sector is balanced and hence helping in good GSDP.
* The contribution from Seconday sector is balanced across all states its neither too high nor too less.

#### Categorisation of states into four groups based on the GDP per capita (C1, C2, C3, C4, where C1 would have the highest per capita GDP and C4, the lowest

In [None]:
# Sorting the df for better visualization

df_sort = df4.T.sort_values(by = 'Per Capita GSDP (Rs.)', ascending = False)

In [None]:
# Define the quantile values and bins for categorisation

df_sort.quantile([0.2,0.5,0.85,1], axis = 0)
bins = [0, 67385, 101332, 153064.85, 271793]
labels = ["C4", "C3", "C2", "C1"]
df_sort['Category'] = pd.cut(df_sort['Per Capita GSDP (Rs.)'], bins = bins, labels = labels)
df_index = df_sort.set_index('Category')
df_sum = df_index.groupby(['Category']).sum()
df_rename =  df_sum.rename(columns = {"Population ('00)" : "Population (00)"})

In [None]:
# Selecting the sub sectors which will be used for further analysis

df7_sector = df_rename[['Agriculture, forestry and fishing','Mining and quarrying','Manufacturing','Electricity, gas, water supply & other utility services',
                 'Construction','Trade, repair, hotels and restaurants','Transport, storage, communication & services related to broadcasting','Financial services',
                'Real estate, ownership of dwelling & professional services','Public administration','Other services','Gross State Domestic Product']]

In [None]:
# Calculating and rounding the percentage contribution of each subsector in total GSDP

df8_per = (df7_sector.T.div(df7_sector.T.loc['Gross State Domestic Product'])*100)
df8_round = df8_per.round(2)
df9_per = df8_round.drop('Gross State Domestic Product')
df9_per

In [None]:
# Plot for % Contribution of subsectors in Total GDP for C1 states for 2014-15

df9_per['C1'].sort_values().plot(kind='bar',stacked=True, colormap = 'Accent')
plt.title("% Contribution of subsectors in Total GDP for C1 states for 2014-15", fontweight = 'bold')
plt.xlabel("Sub-sectors", fontweight = 'bold')
plt.ylabel("% Contribution", fontweight = 'bold')

In [None]:
# Plot for % Contribution of subsectors in Total GDP for C2 states for 2014-15

df9_per['C2'].sort_values().plot(kind='bar',stacked=True, colormap = 'Accent')
plt.title("% Contribution of subsectors in Total GDP for C2 states for 2014-15", fontweight = 'bold')
plt.ylabel("% Contribution", fontweight = 'bold')
plt.xlabel("Sub-sectors", fontweight = 'bold')

In [None]:
# Plot for % Contribution of subsectors in Total GDP for C3 states for 2014-15

df9_per['C3'].sort_values().plot(kind='bar',stacked=True, colormap = 'Accent')
plt.title("% Contribution of subsectors in Total GDP for C3 states for 2014-15", fontweight = 'bold')
plt.ylabel("% Contribution", fontweight = 'bold')
plt.xlabel("Sub-sectors", fontweight = 'bold')

In [None]:
# Plot for % Contribution of subsectors in Total GDP for C4 states for 2014-15

df9_per['C4'].sort_values().plot(kind='bar',stacked=True, colormap = 'Accent')
plt.title("% Contribution of subsectors in Total GDP for C4 states for 2014-15", fontweight = 'bold')
plt.ylabel("% Contribution", fontweight = 'bold')
plt.xlabel("Sub-sectors", fontweight = 'bold')

### Plot for top 3/4/5/6 sub-sectors that contribute to approximately 80% of the GSDP of each category.

In [None]:
# 80% Contribution by top subsectors in Total GSDP for C1/C2/C3/C4 States 2014-15

fig, axes = plt.subplots(2,2, figsize=(15,12))
fig.tight_layout()
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.8, hspace=1.8)


df9 = df9_per.sort_values(by = ['C1', 'C2', 'C3', 'C4'], ascending = False)
topsubsector = df9[df9.C1.cumsum() <= 80]
top_c1 = topsubsector[['C1']]
top_c1.plot(kind='bar',stacked=True, colormap = 'Dark2',ax=axes[0][0])


df9 = df9_per.sort_values(by = ['C2', 'C3', 'C4','C1'], ascending = False)
topsubsector = df9[df9.C2.cumsum() <= 80]
top_c2 = topsubsector[['C2']]
top_c2.plot(kind='bar',stacked=True, colormap = 'Dark2',ax=axes[0][1])


df9 = df9_per.sort_values(by = ['C3', 'C4','C1','C2'], ascending = False)
topsubsector = df9[df9.C3.cumsum() <= 80]
top_c3 = topsubsector[['C3']]
top_c3.plot(kind='bar',stacked=True, colormap = 'prism',ax=axes[1][0])


df9 = df9_per.sort_values(by = ['C4','C1','C2', 'C3'], ascending = False)
topsubsector = df9[df9.C4.cumsum() <= 80]
top_c4 = topsubsector[['C4']]
top_c4.plot(kind='bar',stacked=True, colormap = 'prism',ax=axes[1][1])


### Insights from the above Plot considering % contribution by top subsectors in Total GSDP for C1/C2/C3/C4 States for 2014-15

* In C1, all sub sectors contribution is balanced as compared to the other categories.
* Manufacturing is +vely correlated with high GDP. 
* Agriculture, Forest and Fishing is -vely correlated with high GDP
* Majorly Manufacturing, Real States, Trades are driving all the categories, so if all the categories (C1, C2, C3, C4) focus on these, and it's being poitively correlated with high GDP, it's contibution to total GDP will be more.
* Recommendations (sectors need to improve on):<br>
   C1 - (Mining, Financial)<br>
   C2 - (Mining, Construction)<br> 
   C3 - (Manufacturing, Real Estates, Financial Services)<br> 
   C4 - (Manufacturing, Real Estates)

### GDP and Education Dropout Rates Relationship

In [None]:
# Reading the relevant file on which Analysis needs to be done

file1 = path + 'Dropout rate dataset.csv'
df_dropout = pd.read_csv(file1)

In [None]:
# Renaming the columns which are incorrect

df_rename = df_dropout.rename(columns = {'Primary - 2014-2015' : 'Primary - 2013-2014','Primary - 2014-2015.1' : 'Primary - 2014-2015'})

In [None]:
# Selecting the columns which will be used for further analysis

dfa = df_rename[['Level of Education - State','Primary - 2014-2015','Upper Primary - 2014-2015','Secondary - 2014-2015']] 

In [None]:
# Dropping the union territory because it will not be used in further analysis

dfa1 = dfa.drop([0,5,7,8,9,18,26,35,36])
dfa2 = dfa1.reset_index(drop=True)

In [None]:
# Calculating the Missing Values % contribution in DF

dfa2.isna().mean().round(2) * 100

In [None]:
# Selecting the required column for further analysis

dfa3 = df4.T.reset_index()
dfa4 = dfa3[['State', 'Per Capita GSDP (Rs.)']]

In [None]:
# Concatenating the Education dropout df and Per Capita of States df

dfa5 = pd.concat([dfa2, dfa4], axis = 1)
dfa6 = dfa5.drop(['State'], axis = 1) 
dfa7 = dfa6.set_index('Level of Education - State', drop = True)

### Correlation of GDP per capita with dropout rates in education (primary, upper primary and secondary) for the year 2014-2015 for each state

In [None]:
# Scatter Plot for GDP per capita with dropout rates in education

f = plt.figure()    
f, axes = plt.subplots(nrows = 2, ncols = 2, sharex=True, sharey = False)
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.5, hspace=None)

sc = axes[0][0].scatter(dfa7['Primary - 2014-2015'],dfa7['Per Capita GSDP (Rs.)'], s=100, c='DarkRed',marker="o")
axes[0][0].set_ylabel('Per Capita GSDP (Rs.)')
axes[0][0].set_xlabel('Primary Education')

sc = axes[0][1].scatter(dfa7['Upper Primary - 2014-2015'],dfa7['Per Capita GSDP (Rs.)'], s=100, c='DarkBlue',marker="*")
axes[0][1].set_ylabel('Per Capita GSDP (Rs.)')
axes[0][1].set_xlabel('Upper Primary Education')

sc = axes[1][0].scatter(dfa7['Secondary - 2014-2015'],dfa7['Per Capita GSDP (Rs.)'], s=100, c='DarkGreen',marker="s")
axes[1][0].set_ylabel('Per Capita GSDP (Rs.)')
axes[1][0].set_xlabel('Secondary Education')

#### Correlation of GDP per capita with dropout rates in Primary Education

In [None]:
dfa7.plot(kind='scatter',x='Primary - 2014-2015',y='Per Capita GSDP (Rs.)', s=150, c='DarkRed',marker="o")

#### Correlation of GDP per capita with dropout rates in Upper Primary Education

In [None]:
dfa7.plot(kind='scatter',x='Upper Primary - 2014-2015',y='Per Capita GSDP (Rs.)', s=150, c='DarkRed',marker="*")

#### Correlation of GDP per capita with dropout rates in Secondary Education

In [None]:
dfa7.plot(kind='scatter',x='Secondary - 2014-2015',y='Per Capita GSDP (Rs.)', s=150, c='DarkRed',marker="s")

### Insights from Correlation of GDP per capita with dropout rates in education (primary, upper primary and secondary) for the year 2014-2015

* Drop out rates are higher in Secondary Education.
* Drop out rates for Primary and Upper Primary are less than or equal to 15.
* Drop out rates for Secondary is generally above than 5.
* Statistics for Primary and Upper Primary are pretty similar.
* The outliers are due to Goa and Sikkim because of its Per Capita GDP.

### If this Kernel helped you in any way, some <font color="red"><b>UPVOTES</b></font> would be very much appreciated