### Question 5. (10 marks) Find the age group in India that has the highest percentage of people speaking three languages or more. Do this for all the states and union territories as well. Output this as rows having the following columns: state/ut, age-group, percentage. Call this age-india.csv and script/program to generate this age-india.sh.

# 1. Importing the necessary libraries

In [1]:
import numpy as np
import pandas as pd

# 2. Load Population data and clean
For the purpose of this question we will load the following data file.
- 'C-18 POPULATION BY BILINGUALISM, TRILINGUALISM, AGE AND SEX'

In [2]:
xls_language_data_by_age = pd.read_excel('./dataset/DDW-C18-0000.xlsx', sheet_name='Sheet1', dtype='string')
xls_language_data_by_age.head()

Unnamed: 0,"C-18 POPULATION BY BILINGUALISM, TRILINGUALISM, AGE AND SEX",Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10
0,State,District,Area Name,Total/,Age-group,Number speaking second language,,,Number speaking third language,,
1,code,code,,Rural/,,,,,,,
2,,,,Urban,,Persons,Males,Females,Persons,Males,Females
3,1,2,3,4,5,6,7,8,9,10,11
4,,,,,,,,,,,


## (ii) Filter the datasheet and convert to a cleaner dataframe

The excel sheet that we have loaded in the previous cell is difficult to read in pandas. To make our operations easy, we will filter it and give new column names.

The new column names will be as follows:
- 'State_Code',
- 'District_Code',
- 'Area',
- 'TRU',
- 'Age_Group',
- 'Number Speaking Second Language - Persons',
- 'Number Speaking Second Language - Males',
- 'Number Speaking Second Language - Females',
- 'Number Speaking Third Language - Persons',
- 'Number Speaking Third Language - Males',
- 'Number Speaking Third Language - Females'

In [3]:
# Discard the first 4 rows containing column names
language_data_by_age = xls_language_data_by_age.iloc[5:, :].copy()

# Assign the new column names
language_data_by_age.columns = [
    'State_Code',
    'District_Code',
    'Area',
    'TRU',
    'Age_Group',
    'Number Speaking Second Language - Persons',
    'Number Speaking Second Language - Males',
    'Number Speaking Second Language - Females',
    'Number Speaking Third Language - Persons',
    'Number Speaking Third Language - Males',
    'Number Speaking Third Language - Females'
]

language_data_by_age.reset_index(drop=True, inplace=True)  # reset indexes
language_data_by_age.head()

Unnamed: 0,State_Code,District_Code,Area,TRU,Age_Group,Number Speaking Second Language - Persons,Number Speaking Second Language - Males,Number Speaking Second Language - Females,Number Speaking Third Language - Persons,Number Speaking Third Language - Males,Number Speaking Third Language - Females
0,0,0,INDIA,Total,Total,314988770,176696383,138292387,86009580,50536832,35472748
1,0,0,INDIA,Total,5-9,15649192,8166843,7482349,1844108,978151,865957
2,0,0,INDIA,Total,10-14,34488492,18133423,16355069,7254335,3831131,3423204
3,0,0,INDIA,Total,15-19,42424599,22750908,19673691,12626717,6792766,5833951
4,0,0,INDIA,Total,20-24,41344406,22386694,18957712,12834334,7067614,5766720


## (iii) Convert columns containing numbers to numeric datatype
As of now our dataframe contains all columns read as string datatype, so we will change the datatype of all numeric columns from string to numeric

In [4]:
# Convert the columns containing numbers to numeric datatype
language_data_by_age.iloc[:, 5:] = language_data_by_age.iloc[:, 5:].apply(pd.to_numeric, errors='ignore')
print('The datatypes of columns containing numeric values has been changed from string to numeric')
print(language_data_by_age.dtypes)

The datatypes of columns containing numeric values has been changed from string to numeric
State_Code                                   string
District_Code                                string
Area                                         string
TRU                                          string
Age_Group                                    string
Number Speaking Second Language - Persons     int64
Number Speaking Second Language - Males       int64
Number Speaking Second Language - Females     int64
Number Speaking Third Language - Persons      int64
Number Speaking Third Language - Males        int64
Number Speaking Third Language - Females      int64
dtype: object


# 3. Find the unique state codes in the language data (by age)

In [5]:
# Find the unique state codes in the language data (by age)
state_codes_from_language_data_by_age = language_data_by_age['State_Code'].dropna().unique()
state_codes_from_language_data_by_age = [state_code for state_code in state_codes_from_language_data_by_age]
print('Number of unique state codes in language data (by age) =', len(state_codes_from_language_data_by_age))

Number of unique state codes in language data (by age) = 36


# 4. Find the unique age groups for which the data is given

In [6]:
# Find the unique age groups in the language data (by age)
age_groups_from_language_data_by_age = language_data_by_age['Age_Group'].dropna().unique()
age_groups_from_language_data_by_age = [age_group for age_group in age_groups_from_language_data_by_age]
print('Number of unique age-groups in language data (by age) =', len(age_groups_from_language_data_by_age))
print('The unique values in age-group column are: ', age_groups_from_language_data_by_age)

Number of unique age-groups in language data (by age) = 10
The unique values in age-group column are:  ['Total', '5-9', '10-14', '15-19', '20-24', '25-29', '30-49', '50-69', '70+', 'Age not stated']


In [7]:
# We will remove the age_group 'Total' from this list
age_groups_from_language_data_by_age.remove('Total')
print('The unique values in age-group column are: ', age_groups_from_language_data_by_age)

The unique values in age-group column are:  ['5-9', '10-14', '15-19', '20-24', '25-29', '30-49', '50-69', '70+', 'Age not stated']


# 5. Load Population data by Age and clean
We will load the following data file:
- 'C-13 SINGLE YEAR AGE RETURNS BY RESIDENCE AND SEX '

In [8]:
xls_population_data_by_age = pd.read_excel('./dataset/DDW-0000C-13.xls', dtype='string')
xls_population_data_by_age.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,C-13 SINGLE YEAR AGE RETURNS BY RESIDENCE AND SEX,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13
0,Table,State,Distt.,Area Name,Age,Total,,,Rural,,,Urban,,
1,Name,Code,Code,,,,,,,,,,,
2,,,,,,Persons,Males,Females,Persons,Males,Females,Persons,Males,Females
3,,,,,1,2,3,4,5,6,7,8,9,10
4,,,,,,,,,,,,,,


## (ii) Filter the datasheet and convert to a cleaner dataframe

The excel sheet that we have loaded in the previous cell is difficult to read in pandas. To make our operations easy, we will filter it and give new column names.

The new column names will be as follows:
- 'State_Code',
- 'Age',
- 'Total - Persons',
- 'Total - Males',
- 'Total - Females'

In [9]:
# Discard the first 7 rows containing only useful column names
population_data_by_age = xls_population_data_by_age.iloc[6:, [1, 4, 5, 6, 7]].copy()

# Assign the new column names
population_data_by_age.columns = [
    'State_Code',
    'Age',
    'Total - Persons',
    'Total - Males',
    'Total - Females'
]

population_data_by_age.reset_index(drop=True, inplace=True)  # reset indexes
population_data_by_age.head()

Unnamed: 0,State_Code,Age,Total - Persons,Total - Males,Total - Females
0,0,All ages,1210854977,623270258,587584719
1,0,0,20311234,10633298,9677936
2,0,1,21755197,11381468,10373729
3,0,2,23056268,11952853,11103415
4,0,3,23974041,12331431,11642610


## (iii) Convert columns containing numbers to numeric datatype
As of now our dataframe contains all columns read as string datatype, so we will change the datatype of all numeric columns from string to numeric

In [10]:
# Convert the columns containing numbers to numeric datatype
population_data_by_age.iloc[:, 2:] = population_data_by_age.iloc[:, 2:].apply(pd.to_numeric, errors='ignore')
print('The datatypes of columns containing numeric values has been changed from string to numeric')
print(population_data_by_age.dtypes)

The datatypes of columns containing numeric values has been changed from string to numeric
State_Code         string
Age                string
Total - Persons     int64
Total - Males       int64
Total - Females     int64
dtype: object


# 6. Merge the rows into following age groups

- '5-9',
- '10-14',
- '15-19',
- '20-24',
- '25-29',
- '30-49',
- '50-69',
- '70+',
- 'Age not stated'

In [11]:
def state_population_data_for_age_range(state_data, age_list):
    '''
    This function takes dataframe for state data and the list of age values and it returns the total population,
    males population and females population for all the age values in the list.
    Example:
    If age_list = [5, 6, 7, 8, 9] then the function will return total_persons, total_males, and total_females
    within the age 5-9.
    '''

    # initialize with zero
    total_persons = 0
    total_males = 0
    total_females = 0

    # add the population values for each age value in list
    for age in age_list:
        total_persons += state_data[state_data['Age'] == str(age)]['Total - Persons'].values[0]
        total_males += state_data[state_data['Age'] == str(age)]['Total - Males'].values[0]
        total_females += state_data[state_data['Age'] == str(age)]['Total - Females'].values[0]

    return total_persons, total_males, total_females

In [12]:
# create a data frame to store the poupulation data by age group
population_age_group = pd.DataFrame(columns=['State_Code', 'Age_Group', 'Total - Persons', 'Total - Males', 'Total - Females'])

for state_code in state_codes_from_language_data_by_age:

    # Extract the data for this state
    state_data = population_data_by_age[population_data_by_age['State_Code'] == state_code]

    ####################################### Age Group: 5-9 #############################################

    # calculate data for this age group using our utility function
    total_persons, total_males, total_females = state_population_data_for_age_range(state_data, [x for x in range(5, 10)])

    # append data for this age group to dataframe
    population_age_group.loc[-1] = [state_code, '5-9', total_persons, total_males, total_females]
    population_age_group.index += 1

    ####################################### Age Group: 10-14 #############################################
    
    # calculate data for this age group using our utility function
    total_persons, total_males, total_females = state_population_data_for_age_range(state_data, [x for x in range(10, 15)])

    # append data for this age group to dataframe
    population_age_group.loc[-1] = [state_code, '10-14', total_persons, total_males, total_females]
    population_age_group.index += 1

    ####################################### Age Group: 15-19 #############################################
    
    # calculate data for this age group using our utility function
    total_persons, total_males, total_females = state_population_data_for_age_range(state_data, [x for x in range(15, 20)])

    # append data for this age group to dataframe
    population_age_group.loc[-1] = [state_code, '15-19', total_persons, total_males, total_females]
    population_age_group.index += 1

    ####################################### Age Group: 20-24 #############################################
    
    # calculate data for this age group using our utility function
    total_persons, total_males, total_females = state_population_data_for_age_range(state_data, [x for x in range(20, 25)])

    # append data for this age group to dataframe
    population_age_group.loc[-1] = [state_code, '20-24', total_persons, total_males, total_females]
    population_age_group.index += 1

    ####################################### Age Group: 25-29 #############################################
    
    # calculate data for this age group using our utility function
    total_persons, total_males, total_females = state_population_data_for_age_range(state_data, [x for x in range(25, 30)])

    # append data for this age group to dataframe
    population_age_group.loc[-1] = [state_code, '25-29', total_persons, total_males, total_females]
    population_age_group.index += 1

    ####################################### Age Group: 30-49 #############################################
    
    # calculate data for this age group using our utility function
    total_persons, total_males, total_females = state_population_data_for_age_range(state_data, [x for x in range(30, 50)])

    # append data for this age group to dataframe
    population_age_group.loc[-1] = [state_code, '30-49', total_persons, total_males, total_females]
    population_age_group.index += 1

    ####################################### Age Group: 50-69 #############################################
    
    # calculate data for this age group using our utility function
    total_persons, total_males, total_females = state_population_data_for_age_range(state_data, [x for x in range(50, 70)])

    # append data for this age group to dataframe
    population_age_group.loc[-1] = [state_code, '50-69', total_persons, total_males, total_females]
    population_age_group.index += 1

    ####################################### Age Group: 70+ #############################################
    
    # calculate data for this age group using our utility function
    total_persons, total_males, total_females = state_population_data_for_age_range(state_data, [x for x in range(70, 100)])

    # also add the data for 100+ persons in this age group
    total_persons += state_data[state_data['Age'] == '100+']['Total - Persons'].values[0]
    total_males += state_data[state_data['Age'] == '100+']['Total - Males'].values[0]
    total_females += state_data[state_data['Age'] == '100+']['Total - Females'].values[0]

    # append data for this age group to dataframe
    population_age_group.loc[-1] = [state_code, '70+', total_persons, total_males, total_females]
    population_age_group.index += 1

    ####################################### Age Group: 'Age not stated #############################################

    # also add the data for 100+ persons in this age group
    total_persons = state_data[state_data['Age'] == 'Age not stated']['Total - Persons'].values[0]
    total_males = state_data[state_data['Age'] == 'Age not stated']['Total - Males'].values[0]
    total_females = state_data[state_data['Age'] == 'Age not stated']['Total - Females'].values[0]

    # append data for this age group to dataframe
    population_age_group.loc[-1] = [state_code, 'Age not stated', total_persons, total_males, total_females]
    population_age_group.index += 1

# 7. Save the cleaned dataset for future use

In [13]:
population_age_group.to_csv('./dataset/C13-clean.csv', index=False)
population_age_group.head()

Unnamed: 0,State_Code,Age_Group,Total - Persons,Total - Males,Total - Females
323,0,5-9,126928126,66300466,60627660
322,0,10-14,132709212,69418835,63290377
321,0,15-19,120526449,63982396,56544053
320,0,20-24,111424222,57584693,53839529
319,0,25-29,101413965,51344208,50069757


# 8. Prepare the output file age-india.csv

We will calculate the following values for each state and ut.
- Age group having highest population of people speaking three languages or more

In [14]:
# Prepare the output file to store the data for India, its states, and union territories
age_india = pd.DataFrame(columns=['state/ut', 'age-group', 'percentage'])

for state_code in state_codes_from_language_data_by_age:

    # extract the state's total population data by age group
    total_state_population_by_age_group = population_age_group[population_age_group['State_Code'] == state_code]
    
    # extract the total language data for the state
    language_data_for_state_total = language_data_by_age[(language_data_by_age['State_Code'] == state_code)
                                                        &
                                                        (language_data_by_age['TRU'] == 'Total')]

    # calculate the percentage of people speaking three languages or more for each age group in this state
    # and find the age-group having the max percentage
    
    max_perc_age_group = -1
    for age_group in age_groups_from_language_data_by_age:

        # find the population speaking three languages or more in this age group
        three_lang_or_more_age_group = language_data_for_state_total[language_data_for_state_total['Age_Group'] == age_group]['Number Speaking Third Language - Persons'].values[0]
        
        # find the total state population of this age group
        total_population_for_age_group = total_state_population_by_age_group[total_state_population_by_age_group['Age_Group'] == age_group]['Total - Persons'].values[0]
        
        # calculate percentage
        perc = (three_lang_or_more_age_group / total_population_for_age_group) * 100

        # update max variables
        if(perc > max_perc_age_group):
            max_age_group = age_group
            max_perc_age_group = perc

    # append data to dataframe
    age_india.loc[-1] = [state_code, max_age_group, max_perc_age_group]
    age_india.index += 1

# dump data to csv files
age_india = age_india.sort_values('state/ut')  # sort values by first column
age_india.to_csv('./output/age-india.csv', index=False)
age_india.head()

Unnamed: 0,state/ut,age-group,percentage
35,0,20-24,11.518442
34,1,20-24,29.580167
33,2,20-24,8.206521
32,3,15-19,45.763949
31,4,50-69,42.774903


--------------------------------------------------------------------------------- END of Q5 ---------------------------------------------------------------------------------------------