### Question 6. (10 marks) Find the literacy group in India that has the highest percentage of people speaking three languages or more. Output this as rows having the following columns: state/ut, literacy-group, percentage. Call this literacy-india.csv and script/program to generate this literacy-india.sh.

# 1. Importing the necessary libraries

In [53]:
import numpy as np
import pandas as pd

# 2. Load Population data and clean
For the purpose of this question we will load the following data file.
- 'C-19 POPULATION BY BILINGUALISM, TRILINGUALISM, EDUCATIONAL LEVEL AND SEX'

In [54]:
xls_language_data_by_education = pd.read_excel('./dataset/DDW-C19-0000.xlsx', sheet_name='Sheet1', dtype='string')
xls_language_data_by_education.head()

Unnamed: 0,"C-19 POPULATION BY BILINGUALISM, TRILINGUALISM, EDUCATIONAL LEVEL AND SEX",Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10
0,State,District,Area Name,Total/,Educational level,Number speaking second language,,,Number speaking third language,,
1,code,code,,Rural/,,,,,,,
2,,,,Urban/,,Persons,Males,Females,Persons,Males,Females
3,1,2,3,4,5,6,7,8,9,10,11
4,,,,,,,,,,,


## (ii) Filter the datasheet and convert to a cleaner dataframe

The excel sheet that we have loaded in the previous cell is difficult to read in pandas. To make our operations easy, we will filter it and give new column names.

The new column names will be as follows:
- 'State_Code',
- 'District_Code',
- 'Area',
- 'TRU',
- 'Education_Level',
- 'Number Speaking Second Language - Persons',
- 'Number Speaking Second Language - Males',
- 'Number Speaking Second Language - Females',
- 'Number Speaking Third Language - Persons',
- 'Number Speaking Third Language - Males',
- 'Number Speaking Third Language - Females'

In [55]:
# Discard the first 4 rows containing column names
language_data_by_education = xls_language_data_by_education.iloc[5:-3, :].copy()

# Assign the new column names
language_data_by_education.columns = [
    'State_Code',
    'District_Code',
    'Area',
    'TRU',
    'Education_Level',
    'Number Speaking Second Language - Persons',
    'Number Speaking Second Language - Males',
    'Number Speaking Second Language - Females',
    'Number Speaking Third Language - Persons',
    'Number Speaking Third Language - Males',
    'Number Speaking Third Language - Females'
]

language_data_by_education.reset_index(drop=True, inplace=True)  # reset indexes
language_data_by_education.head()

Unnamed: 0,State_Code,District_Code,Area,TRU,Education_Level,Number Speaking Second Language - Persons,Number Speaking Second Language - Males,Number Speaking Second Language - Females,Number Speaking Third Language - Persons,Number Speaking Third Language - Males,Number Speaking Third Language - Females
0,0,0,INDIA,Total,Total,314988770,176696383,138292387,86009580,50536832,35472748
1,0,0,INDIA,Total,Illiterate,42266268,17851584,24414684,3879858,1890285,1989573
2,0,0,INDIA,Total,Literate,272722502,158844799,113877703,82129722,48646547,33483175
3,0,0,INDIA,Total,Literate but below primary,29345104,16126959,13218145,3733616,2108024,1625592
4,0,0,INDIA,Total,Primary but below middle,48570544,26588496,21982048,8636296,4782211,3854085


## (iii) Convert columns containing numbers to numeric datatype
As of now our dataframe contains all columns read as string datatype, so we will change the datatype of all numeric columns from string to numeric

In [56]:
# Convert the columns containing numbers to numeric datatype
language_data_by_education.iloc[:, 5:] = language_data_by_education.iloc[:, 5:].apply(pd.to_numeric, errors='ignore')
print('The datatypes of columns containing numeric values has been changed from string to numeric')
print(language_data_by_education.dtypes)

The datatypes of columns containing numeric values has been changed from string to numeric
State_Code                                   string
District_Code                                string
Area                                         string
TRU                                          string
Education_Level                              string
Number Speaking Second Language - Persons     int64
Number Speaking Second Language - Males       int64
Number Speaking Second Language - Females     int64
Number Speaking Third Language - Persons      int64
Number Speaking Third Language - Males        int64
Number Speaking Third Language - Females      int64
dtype: object


# 3. Find the unique state codes in the language data (by education)

In [57]:
# Find the unique state codes in the language data (by education)
state_codes_from_language_data_by_education = language_data_by_education['State_Code'].dropna().unique()
state_codes_from_language_data_by_education = [state_code for state_code in state_codes_from_language_data_by_education]
print('Number of unique state codes in language data (by education) =', len(state_codes_from_language_data_by_education))

Number of unique state codes in language data (by education) = 36


# 5. Find the unique literacy groups for which the data is given

In [58]:
# Find the unique age groups in the language data (by education)
literacy_groups_from_language_data_by_education = language_data_by_education['Education_Level'].dropna().unique()
literacy_groups_from_language_data_by_education = [literacy_group for literacy_group in literacy_groups_from_language_data_by_education]
print('Number of unique literacy-groups in language data (by education) =', len(literacy_groups_from_language_data_by_education))
print('The unique values in literacy-group column are: ', literacy_groups_from_language_data_by_education)

Number of unique literacy-groups in language data (by education) = 8
The unique values in literacy-group column are:  <StringArray>
[                              'Total',                          'Illiterate',
                            'Literate',          'Literate but below primary',
            'Primary but below middle',   'Middle but below matric/secondary',
 'Matric/Secondary but below graduate',                  'Graduate and above']
Length: 8, dtype: string


In [59]:
# We will remove the literacy group 'Total' from this list
literacy_groups_from_language_data_by_education.remove('Total')
print('The unique values in education level column are: ', literacy_groups_from_language_data_by_education)

The unique values in education level column are:  ['Illiterate', 'Literate', 'Literate but below primary', 'Primary but below middle', 'Middle but below matric/secondary', 'Matric/Secondary but below graduate', 'Graduate and above']


# 6. Load Population data by Education and clean
We will load the following data file:
- 'C-08 Data by Literacy and Age group'

In [60]:
xls_population_data_by_education = pd.read_excel('./dataset/DDW-0000C-08.xlsx', dtype='string')
xls_population_data_by_education.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 35,Unnamed: 36,Unnamed: 37,Unnamed: 38,Unnamed: 39,Unnamed: 40,Unnamed: 41,Unnamed: 42,Unnamed: 43,Unnamed: 44
0,Table,State,Distt.,Area Name,Total/,Age-group,Total,,,Illiterate,...,,,,,,,,,,
1,Name,Code,Code,,Rural/,,,,,,...,,Technical diploma or certificate,,,Graduate & above,,,Unclassified,,
2,,,,,Urban/,,,,,,...,,not equal to degree,,,,,,,,
3,,,,,,,Persons,Males,Females,Persons,...,Females,Persons,Males,Females,Persons,Males,Females,Persons,Males,Females
4,,,,,,1,2,3,4,5,...,31,32,33,34,35,36,37,38,39,40


## (ii) Filter the datasheet and convert to a cleaner dataframe

The excel sheet that we have loaded in the previous cell is difficult to read in pandas. To make our operations easy, we will filter it and give new column names.

In [61]:
# Discard the first 6 rows
population_data_by_education = xls_population_data_by_education.iloc[6:, :].copy()


# Assign the new column names
population_data_by_education.columns = [
    'Table_Name',
    'State_Code',
    'District_Code',
    'Area_Name',
    'TRU',
    'Age_Group',
    'Total - Persons',
    'Total - Males',
    'Total - Females',
    'Illiterate - Persons',
    'Illiterate - Males',
    'Illiterate - Females',
    'Literate - Persons',
    'Literate - Males',
    'Literate - Females',
    'Literate without education level - Persons',
    'Literate without education level- Males',
    'Literate without education level - Females',
    'Below primary - Persons',
    'Below primary - Males',
    'Below primary - Females',
    'Primary - Persons',
    'Primary - Males',
    'Primary - Females',
    'Middle - Persons',
    'Middle - Males',
    'Middle - Females',
    'Matric/Secondary - Persons',
    'Matric/Secondary - Males',
    'Matric/Secondary - Females',
    'Higher secondary/Intermediate/Pre-University/Senior secondary - Persons',
    'Higher secondary/Intermediate/Pre-University/Senior secondary - Males',
    'Higher secondary/Intermediate/Pre-University/Senior secondary - Females',
    'Non-technical diploma or certificate not equal to degree - Persons',
    'Non-technical diploma or certificate not equal to degree - Males',
    'Non-technical diploma or certificate not equal to degree - Females',
    'Technical diploma or certificate not equal to degree - Persons',
    'Technical diploma or certificate not equal to degree - Males',
    'Technical diploma or certificate not equal to degree - Females',
    'Graduate & above - Persons',
    'Graduate & above - Males',
    'Graduate & above - Females',
    'Unclassified - Persons',
    'Unclassified - Males',
    'Unclassified - Females'
]

population_data_by_education.reset_index(drop=True, inplace=True)  # reset indexes
population_data_by_education.head()

Unnamed: 0,Table_Name,State_Code,District_Code,Area_Name,TRU,Age_Group,Total - Persons,Total - Males,Total - Females,Illiterate - Persons,...,Non-technical diploma or certificate not equal to degree - Females,Technical diploma or certificate not equal to degree - Persons,Technical diploma or certificate not equal to degree - Males,Technical diploma or certificate not equal to degree - Females,Graduate & above - Persons,Graduate & above - Males,Graduate & above - Females,Unclassified - Persons,Unclassified - Males,Unclassified - Females
0,C2308,0,0,INDIA,Total,All ages,1210854977,623270258,587584719,447216165,...,345724,7238719,5354161,1884558,68288971,42120460,26168511,3031570,1647116,1384454
1,C2308,0,0,INDIA,Total,0-6,164515253,85752254,78762999,164515253,...,0,0,0,0,0,0,0,0,0,0
2,C2308,0,0,INDIA,Total,7,24826640,12903364,11923276,6748214,...,0,0,0,0,0,0,0,136465,75715,60750
3,C2308,0,0,INDIA,Total,8,26968373,14061937,12906436,4131414,...,0,0,0,0,0,0,0,96524,52561,43963
4,C2308,0,0,INDIA,Total,9,23424638,12214985,11209653,2491904,...,0,0,0,0,0,0,0,70452,38456,31996


## (iii) Convert columns containing numbers to numeric datatype
As of now our dataframe contains all columns read as string datatype, so we will change the datatype of all numeric columns from string to numeric

In [62]:
# Convert the columns containing numbers to numeric datatype
population_data_by_education.iloc[:, 6:] = population_data_by_education.iloc[:, 6:].apply(pd.to_numeric, errors='ignore')
print('The datatypes of columns containing numeric values has been changed from string to numeric')
print(population_data_by_education.dtypes)

The datatypes of columns containing numeric values has been changed from string to numeric
Table_Name                                                                 string
State_Code                                                                 string
District_Code                                                              string
Area_Name                                                                  string
TRU                                                                        string
Age_Group                                                                  string
Total - Persons                                                             int64
Total - Males                                                               int64
Total - Females                                                             int64
Illiterate - Persons                                                        int64
Illiterate - Males                                                          int64
Illiter

# 7. Merge the data into following literacy groups

- 'Illiterate',
- 'Literate',
- 'Literate but below primary',
- 'Primary but below middle',
- 'Middle but below matric/secondary',
- 'Matric/Secondary but below graduate',
- 'Graduate and above'

In [63]:
# create a data frame to store the population data by literacy group
population_literacy_group = pd.DataFrame(columns=['State_Code', 'Literacy_Group', 'Total - Persons', 'Total - Males', 'Total - Females'])

for state_code in state_codes_from_language_data_by_education:

    # Extract the data for this state
    state_data = population_data_by_education[(population_data_by_education['State_Code'] == state_code)
                                              &
                                              (population_data_by_education['TRU'] == 'Total')
                                              &
                                              (population_data_by_education['Age_Group'] == 'All ages')]
    
    ####################################### Literacy Group: Illiterate #############################################

    # extract the population data for this literacy group
    total_persons = state_data['Illiterate - Persons'].values[0]
    total_males = state_data['Illiterate - Males'].values[0]
    total_females = state_data['Illiterate - Females'].values[0]

    # append data for this age group to dataframe
    population_literacy_group.loc[-1] = [state_code, 'Illiterate', total_persons, total_males, total_females]
    population_literacy_group.index += 1

    ####################################### Literacy Group: Literate #############################################

    # extract the population data for this literacy group
    total_persons = state_data['Literate - Persons'].values[0]
    total_males = state_data['Literate - Males'].values[0]
    total_females = state_data['Literate - Females'].values[0]

    # append data for this age group to dataframe
    population_literacy_group.loc[-1] = [state_code, 'Literate', total_persons, total_males, total_females]
    population_literacy_group.index += 1

    ####################################### Literacy Group: 'Literate but below primary' #############################################

    # extract the population data for this literacy group
    total_persons = state_data['Below primary - Persons'].values[0]
    total_males = state_data['Below primary - Males'].values[0]
    total_females = state_data['Below primary - Females'].values[0]

    # append data for this age group to dataframe
    population_literacy_group.loc[-1] = [state_code, 'Literate but below primary', total_persons, total_males, total_females]
    population_literacy_group.index += 1

    ####################################### Literacy Group: 'Primary but below middle' #############################################

    # extract the population data for this literacy group
    total_persons = state_data['Primary - Persons'].values[0]
    total_males = state_data['Primary - Males'].values[0]
    total_females = state_data['Primary - Females'].values[0]

    # append data for this age group to dataframe
    population_literacy_group.loc[-1] = [state_code, 'Primary but below middle', total_persons, total_males, total_females]
    population_literacy_group.index += 1

    ####################################### Literacy Group: 'Middle but below matric/secondary' #############################################

    # extract the population data for this literacy group
    total_persons = state_data['Middle - Persons'].values[0]
    total_males = state_data['Middle - Males'].values[0]
    total_females = state_data['Middle - Females'].values[0]

    # append data for this age group to dataframe
    population_literacy_group.loc[-1] = [state_code, 'Middle but below matric/secondary', total_persons, total_males, total_females]
    population_literacy_group.index += 1

    ####################################### Literacy Group: 'Matric/Secondary but below graduate' #############################################

    # extract the population data for this literacy group
    total_persons = state_data['Matric/Secondary - Persons'].values[0]
    total_males = state_data['Matric/Secondary - Males'].values[0]
    total_females = state_data['Matric/Secondary - Females'].values[0]

    total_persons += state_data['Higher secondary/Intermediate/Pre-University/Senior secondary - Persons'].values[0]
    total_males += state_data['Higher secondary/Intermediate/Pre-University/Senior secondary - Males'].values[0]
    total_females += state_data['Higher secondary/Intermediate/Pre-University/Senior secondary - Females'].values[0]

    total_persons += state_data['Non-technical diploma or certificate not equal to degree - Persons'].values[0]
    total_males += state_data['Non-technical diploma or certificate not equal to degree - Males'].values[0]
    total_females += state_data['Non-technical diploma or certificate not equal to degree - Females'].values[0]

    total_persons += state_data['Technical diploma or certificate not equal to degree - Persons'].values[0]
    total_males += state_data['Technical diploma or certificate not equal to degree - Males'].values[0]
    total_females += state_data['Technical diploma or certificate not equal to degree - Females'].values[0]

    # append data for this age group to dataframe
    population_literacy_group.loc[-1] = [state_code, 'Matric/Secondary but below graduate', total_persons, total_males, total_females]
    population_literacy_group.index += 1

    ####################################### Literacy Group: 'Graduate and above' #############################################

    # extract the population data for this literacy group
    total_persons = state_data['Graduate & above - Persons'].values[0]
    total_males = state_data['Graduate & above - Males'].values[0]
    total_females = state_data['Graduate & above - Females'].values[0]

    # append data for this age group to dataframe
    population_literacy_group.loc[-1] = [state_code, 'Graduate and above', total_persons, total_males, total_females]
    population_literacy_group.index += 1

# 8. Save the cleaned dataset for future use

In [64]:
population_literacy_group.to_csv('./dataset/C08-clean.csv', index=False)
population_literacy_group.head()

Unnamed: 0,State_Code,Literacy_Group,Total - Persons,Total - Males,Total - Females
251,0,Illiterate,447216165,188506636,258709529
250,0,Literate,763638812,434763622,328875190
249,0,Literate but below primary,146897597,78445099,68452498
248,0,Primary but below middle,184170833,99311072,84859761
247,0,Middle but below matric/secondary,133903266,77629578,56273688


# 9. Prepare the output file literacy-india.csv

We will calculate the following values for each state and ut.
- Literacy group having highest population of people speaking three languages or more

In [65]:
# Prepare the output file to store the data for India, its states, and union territories
literacy_india = pd.DataFrame(columns=['state/ut', 'literacy-group', 'percentage'])

for state_code in state_codes_from_language_data_by_education:

    # extract the state's total population data by literacy
    total_state_population_by_literacy_group = population_literacy_group[population_literacy_group['State_Code'] == state_code]
    
    # extract the total language data for the state
    language_data_for_state_total = language_data_by_education[(language_data_by_education['State_Code'] == state_code)
                                                                &
                                                               (language_data_by_education['TRU'] == 'Total')]

    # calculate the percentage of people speaking three languages or more for each literacy group in this state
    # and find the literacy-group having the max percentage
    
    max_perc_literacy_group = -1
    for literacy_group in literacy_groups_from_language_data_by_education:

        # find the population speaking three languages or more in this literacy group
        three_lang_or_more_literacy_group = language_data_for_state_total[language_data_for_state_total['Education_Level'] == literacy_group]['Number Speaking Third Language - Persons'].values[0]
        
        # find the total state population of this literacy group
        total_population_for_literacy_group = total_state_population_by_literacy_group[total_state_population_by_literacy_group['Literacy_Group'] == literacy_group]['Total - Persons'].values[0]
        
        # calculate percentage
        perc = (three_lang_or_more_literacy_group / total_population_for_literacy_group) * 100

        # update max variables
        if(perc > max_perc_literacy_group):
            max_literacy_group = literacy_group
            max_perc_literacy_group = perc

    # append data to dataframe
    literacy_india.loc[-1] = [state_code, max_literacy_group, max_perc_literacy_group]
    literacy_india.index += 1

# dump data to csv files
literacy_india = literacy_india.sort_values('state/ut')  # sort values by first column
literacy_india.to_csv('./output/literacy-india.csv', index=False)
literacy_india.head()

Unnamed: 0,state/ut,literacy-group,percentage
35,0,Graduate and above,32.430629
34,1,Graduate and above,67.385479
33,2,Graduate and above,15.250252
32,3,Graduate and above,76.651962
31,4,Graduate and above,69.466658


--------------------------------------------------------------------------------- END of Q6 ---------------------------------------------------------------------------------------------