# Project 3: Does Access to the Internet Improve Gender Parity?

Access to the internet is an intriguing lens through which to explore gender equality in education. While the internet is often lauded for its potential to democratize information and provide learning opportunities, its impact on female education access remains understudied.

Using publicly available data on female internet usage and female school enrollment rates, I conducted a preliminary analysis to investigate whether greater internet access among women correlates with improved educational outcomes.

I wanted to know: Does increased internet access among females promote higher school enrollment rates, potentially advancing gender parity in education? My hypothesis is that greater access to the internet provides educational resources and reduces traditional barriers, thereby encouraging female participation in formal education.

## Section 1: Prep

### Step 1: Import packages

In [785]:
import plotly.io as pio
pio.renderers.default = "vscode+jupyterlab+notebook_connected"

import plotly.express as px

import pandas as pd

import numpy as np

### Step 2: Read csv files

In [786]:
file_pathx = '/Users/wanqi/Desktop/Columbia/24 Fall/Computing in Context/Project 3/data/world_femint.csv'
file_pathy = '/Users/wanqi/Desktop/Columbia/24 Fall/Computing in Context/Project 3/data/world_enrol.csv'

f_web = pd.read_csv(file_pathx)
f_school=pd.read_csv(file_pathy)

f_web.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 68 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Country Name   266 non-null    object
 1   Country Code   266 non-null    object
 2   Series Name    266 non-null    object
 3   Series Code    266 non-null    object
 4   1960 [YR1960]  266 non-null    object
 5   1961 [YR1961]  266 non-null    object
 6   1962 [YR1962]  266 non-null    object
 7   1963 [YR1963]  266 non-null    object
 8   1964 [YR1964]  266 non-null    object
 9   1965 [YR1965]  266 non-null    object
 10  1966 [YR1966]  266 non-null    object
 11  1967 [YR1967]  266 non-null    object
 12  1968 [YR1968]  266 non-null    object
 13  1969 [YR1969]  266 non-null    object
 14  1970 [YR1970]  266 non-null    object
 15  1971 [YR1971]  266 non-null    object
 16  1972 [YR1972]  266 non-null    object
 17  1973 [YR1973]  266 non-null    object
 18  1974 [YR1974]  266 non-null   

### Step 3: Prompt
- **Dataset(s) to be used:** 
School Enrollment, primary, female (% gross):  https://databank.worldbank.org/metadataglossary/2/series/SE.PRM.ENRR.FE; 
Individuals using the Internet, female (% of female population) (IT.NET.USER.FE.ZS); 
GDP per capita (constant 2015 US$): https://databank.worldbank.org/metadataglossary/sustainable-development-goals-%28sdgs%29/series/NY.GDP.PCAP.KD

- **Analysis question:** Does increased access to the internet among females positively correlate with higher female school enrollment rates, thereby promoting gender parity in education?

- **Columns that will (likely) be used:**
  - [Country Name]
  - [Country Code]
  - [All Year Columns]

- **Columns to be used to merge/join them:**
  - [f_web] [Country Name, Country Code, Year]
  - [f_school] [Country Name, Country Code, Year]

- **Hypothesis**: Increased internet access among females is positively correlated with higher female school enrollment rates, as greater access to online resources and information promotes educational opportunities and reduces barriers to gender parity in education.

- **Site URL:** [the `*.readthedocs.io` URL of your live site, from the Publish section]

## Section 2: Data Cleaning

### Step 1: Clean f_web data

In [787]:
# Reshape from wide to long using melt

femint = pd.melt(
    f_web,
    id_vars=['Country Name', 'Country Code','Series Name','Series Code'],  # Keep these columns as identifiers
    var_name='Year',  # New column for year
    value_name='Female Internet Access (%)'  # New column for internet access values
)

femint = femint.drop(columns={'Series Name','Series Code'})
femint

Unnamed: 0,Country Name,Country Code,Year,Female Internet Access (%)
0,Afghanistan,AFG,1960 [YR1960],..
1,Albania,ALB,1960 [YR1960],..
2,Algeria,DZA,1960 [YR1960],..
3,American Samoa,ASM,1960 [YR1960],..
4,Andorra,AND,1960 [YR1960],..
...,...,...,...,...
17019,Sub-Saharan Africa,SSF,2023 [YR2023],..
17020,Sub-Saharan Africa (excluding high income),SSA,2023 [YR2023],..
17021,Sub-Saharan Africa (IDA & IBRD countries),TSS,2023 [YR2023],..
17022,Upper middle income,UMC,2023 [YR2023],..


In [788]:
# Format Year Column and drop NAs
femint['Year'] = femint['Year'].str.extract('(\d{4})')[0].astype(int)
femint.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17024 entries, 0 to 17023
Data columns (total 4 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Country Name                17024 non-null  object
 1   Country Code                17024 non-null  object
 2   Year                        17024 non-null  int64 
 3   Female Internet Access (%)  17024 non-null  object
dtypes: int64(1), object(3)
memory usage: 532.1+ KB



invalid escape sequence '\d'


invalid escape sequence '\d'


invalid escape sequence '\d'



In [789]:
# Replace ".." with NaN
femint['Female Internet Access (%)'] = femint['Female Internet Access (%)'].replace("..", np.nan)

# Drop rows with NaN values in the 'Female Internet Access (%)' column
femint = femint.dropna(subset=['Female Internet Access (%)'])

# Display the cleaned DataFrame
femint

Unnamed: 0,Country Name,Country Code,Year,Female Internet Access (%)
10644,Andorra,AND,2000,26.9841
11183,Austria,AUT,2002,31.8275
11225,Denmark,DNK,2002,59.8367
11239,Finland,FIN,2002,60.5637
11245,Germany,DEU,2002,43.7461
...,...,...,...,...
16956,Turkiye,TUR,2023,82.1321
16962,United Arab Emirates,ARE,2023,100
16966,Uzbekistan,UZB,2023,87.1292
16969,Viet Nam,VNM,2023,75.8894


### Step 2: Clean f_school data

In [790]:
# Reshape from wide to long using melt

femsch = pd.melt(
    f_school,
    id_vars=['Country Name', 'Country Code','Series Name','Series Code'],  # Keep these columns as identifiers
    var_name='Year',  # New column for year
    value_name='Female Primary School Enrolment (%)'  # New column for internet access values
)

femsch = femsch.drop(columns={'Series Name','Series Code'})
femsch

Unnamed: 0,Country Name,Country Code,Year,Female Primary School Enrolment (%)
0,Afghanistan,AFG,1960 [YR1960],..
1,Africa Eastern and Southern,AFE,1960 [YR1960],..
2,Africa Western and Central,AFW,1960 [YR1960],..
3,Albania,ALB,1960 [YR1960],..
4,Algeria,DZA,1960 [YR1960],..
...,...,...,...,...
17019,West Bank and Gaza,PSE,2023 [YR2023],..
17020,World,WLD,2023 [YR2023],..
17021,"Yemen, Rep.",YEM,2023 [YR2023],..
17022,Zambia,ZMB,2023 [YR2023],..


In [791]:
# Format Year Column and drop NAs
femsch['Year'] = femsch['Year'].str.extract('(\d{4})')[0].astype(int)
femsch.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17024 entries, 0 to 17023
Data columns (total 4 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   Country Name                         17024 non-null  object
 1   Country Code                         17024 non-null  object
 2   Year                                 17024 non-null  int64 
 3   Female Primary School Enrolment (%)  17024 non-null  object
dtypes: int64(1), object(3)
memory usage: 532.1+ KB



invalid escape sequence '\d'


invalid escape sequence '\d'


invalid escape sequence '\d'



In [792]:
# Replace ".." with NaN
femsch['Female Primary School Enrolment (%)'] = femsch['Female Primary School Enrolment (%)'].replace("..", np.nan)

# Drop rows with NaN values in the 'Female Internet Access (%)' column
femsch = femsch.dropna(subset=['Female Primary School Enrolment (%)'])

femsch

Unnamed: 0,Country Name,Country Code,Year,Female Primary School Enrolment (%)
2660,Afghanistan,AFG,1970,9.534939766
2661,Africa Eastern and Southern,AFE,1970,51.47764969
2662,Africa Western and Central,AFW,1970,32.86861038
2669,Arab World,ARB,1970,52.1231308
2670,Argentina,ARG,1970,106.1132965
...,...,...,...,...
16951,Peru,PER,2023,107.5719986
16976,Somalia,SOM,2023,19.23847961
16994,Syrian Arab Republic,SYR,2023,79.65898132
16997,Thailand,THA,2023,98.21199799


# Step 3: Merge Data and Visualize

In [793]:
# Merge the two DataFrames on 'Country Name', 'Country Code', and 'Year'
merged_data = pd.merge(
    femint,
    femsch,
    on=['Country Name', 'Country Code', 'Year'],
    how='inner'  # Use inner join to keep only matching rows
)

merged_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1228 entries, 0 to 1227
Data columns (total 5 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   Country Name                         1228 non-null   object
 1   Country Code                         1228 non-null   object
 2   Year                                 1228 non-null   int64 
 3   Female Internet Access (%)           1228 non-null   object
 4   Female Primary School Enrolment (%)  1228 non-null   object
dtypes: int64(1), object(4)
memory usage: 48.1+ KB


In [794]:
# Ensure the columns have no NaN values before conversion
merged_data.dropna(subset=['Female Internet Access (%)', 'Female Primary School Enrolment (%)'], inplace=True)

# Convert the columns to float
merged_data['Female Internet Access (%)'] = merged_data['Female Internet Access (%)'].astype(float)
merged_data['Female Primary School Enrolment (%)'] = merged_data['Female Primary School Enrolment (%)'].astype(float)

merged_data

Unnamed: 0,Country Name,Country Code,Year,Female Internet Access (%),Female Primary School Enrolment (%)
0,Austria,AUT,2002,31.8275,102.157829
1,Denmark,DNK,2002,59.8367,101.688759
2,Finland,FIN,2002,60.5637,99.311958
3,Germany,DEU,2002,43.7461,104.565582
4,Greece,GRC,2002,11.7320,102.311241
...,...,...,...,...,...
1223,Georgia,GEO,2023,82.0899,103.306000
1224,Kazakhstan,KAZ,2023,91.7601,100.651001
1225,Paraguay,PRY,2023,79.2152,89.920998
1226,Thailand,THA,2023,88.6092,98.211998


In [795]:
fig1 = px.scatter(
    merged_data,
    x='Female Internet Access (%)',
    y='Female Primary School Enrolment (%)',
    color='Country Name', 
    title='Relationship Between Female Internet Access and School Enrollment',
    hover_name='Country Name',
)

fig1.show()


In [796]:
#I want a graph with a trendline as well.

fig2 = px.scatter(
    merged_data,
    x='Female Internet Access (%)',
    y='Female Primary School Enrolment (%)', 
    title='Relationship Between Female Internet Access and School Enrollment',
    hover_name='Country Name',
    trendline='ols',
    trendline_color_override='red'
)

fig2.show()

It's surprising when we see the trendline. There doesn't seem to be a positive correlation at all! I suspect this has to do, again, with differences economic development between countries/regions...

# Extra research: Regional Comparison with Economic Development in Mind

My 2nd hypothesis is that regions with higher levels of econ development might exhibit a positive correlation. In particular, in regions with higher levels of economic development, we might expect a stronger positive correlation between female internet access and school enrollment. This is because these countries are more likely to have the necessary infrastructure, policies, and social norms that support both digital inclusion and gender parity in education. For instance, widespread internet access in these regions often complements existing educational systems, enabling students, especially females, to leverage online resources for learning.

In contrast, in less economically developed regions, the relationship may be weaker or even absent due to systemic barriers. Limited infrastructure, affordability issues, and socio-cultural constraints may prevent the internet from effectively improving female educational outcomes. Furthermore, in some cases, female internet access might not yet be sufficiently widespread to make a meaningful impact on education statistics, i.e. the absolute value is low.

## Step 1: Read, reshape and clean new GDP data

In [797]:
file_pathz='/Users/wanqi/Desktop/Columbia/24 Fall/Computing in Context/Project 3/data/gdp.csv'

gdp=pd.read_csv(file_pathz)
gdp.head()


Unnamed: 0,Series Name,Series Code,Country Name,Country Code,1960 [YR1960],1961 [YR1961],1962 [YR1962],1963 [YR1963],1964 [YR1964],1965 [YR1965],...,2014 [YR2014],2015 [YR2015],2016 [YR2016],2017 [YR2017],2018 [YR2018],2019 [YR2019],2020 [YR2020],2021 [YR2021],2022 [YR2022],2023 [YR2023]
0,GDP per capita (constant 2015 US$),NY.GDP.PCAP.KD,Afghanistan,AFG,..,..,..,..,..,..,...,576.4878173,566.8811297,564.9208406,563.4882365,553.973306,559.140954,529.1449097,407.6165052,372.6158948,..
1,GDP per capita (constant 2015 US$),NY.GDP.PCAP.KD,Albania,ALB,..,..,..,..,..,..,...,3855.760744,3952.803574,4090.372728,4249.820049,4431.555595,4543.38771,4418.660874,4857.111942,5155.29086,5394.18241
2,GDP per capita (constant 2015 US$),NY.GDP.PCAP.KD,Algeria,DZA,2394.837448,2032.565159,1607.554288,2124.677213,2210.036534,2308.886774,...,4687.288575,4741.49977,4829.185829,4806.631255,4782.034707,4737.129774,4422.979472,4515.573999,4602.575608,4717.399329
3,GDP per capita (constant 2015 US$),NY.GDP.PCAP.KD,American Samoa,ASM,..,..,..,..,..,..,...,12494.98021,13101.54182,13116.43098,12442.85767,13049.33015,13288.35656,14214.64646,14464.81405,14969.05994,..
4,GDP per capita (constant 2015 US$),NY.GDP.PCAP.KD,Andorra,AND,..,..,..,..,..,..,...,38402.64926,38885.53032,39886.64008,39321.61431,39320.09245,39413.79088,34394.41404,36616.10076,39720.95186,40161.76953


In [798]:
# Reshape from wide to long using melt

world_gdp = pd.melt(
    gdp,
    id_vars=['Country Name', 'Country Code','Series Name','Series Code'],  # Keep these columns as identifiers
    var_name='Year',  # New column for year
    value_name='GDP per Capita'  # New column for internet access values
)

world_gdp = world_gdp.drop(columns={'Series Name','Series Code'})
world_gdp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17024 entries, 0 to 17023
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Country Name    17024 non-null  object
 1   Country Code    17024 non-null  object
 2   Year            17024 non-null  object
 3   GDP per Capita  17024 non-null  object
dtypes: object(4)
memory usage: 532.1+ KB


In [799]:
# Format Year Column, GDP per capita column, and drop NAs
world_gdp['Year'] = world_gdp['Year'].str.extract('(\d{4})')[0].astype(int)
world_gdp['GDP per Capita'] = pd.to_numeric(world_gdp['GDP per Capita'], errors='coerce')
world_gdp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17024 entries, 0 to 17023
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Country Name    17024 non-null  object 
 1   Country Code    17024 non-null  object 
 2   Year            17024 non-null  int64  
 3   GDP per Capita  13963 non-null  float64
dtypes: float64(1), int64(1), object(2)
memory usage: 532.1+ KB



invalid escape sequence '\d'


invalid escape sequence '\d'


invalid escape sequence '\d'



In [800]:
# Replace ".." with NaN
world_gdp['GDP per Capita'] = world_gdp['GDP per Capita'].replace("..", np.nan)

# Drop rows with NaN values in the 'Female Internet Access (%)' column
world_gdp = world_gdp.dropna(subset=['GDP per Capita'])

world_gdp

Unnamed: 0,Country Name,Country Code,Year,GDP per Capita
2,Algeria,DZA,1960,2394.837448
7,Argentina,ARG,1960,7410.305029
10,Australia,AUS,1960,19904.943410
11,Austria,AUT,1960,11960.957510
13,"Bahamas, The",BHS,1960,19087.682250
...,...,...,...,...
17019,Sub-Saharan Africa,SSF,2023,1604.265968
17020,Sub-Saharan Africa (excluding high income),SSA,2023,1602.808470
17021,Sub-Saharan Africa (IDA & IBRD countries),TSS,2023,1604.265968
17022,Upper middle income,UMC,2023,9801.905948


## Step 2: Merge world_gdp with merged_data and visualize

In [801]:
final_data = pd.merge(
    merged_data,
    world_gdp,
    on=['Country Name', 'Country Code','Year'],  # Common columns to merge on
    how='inner'  # Keep only matching rows
)

final_data

Unnamed: 0,Country Name,Country Code,Year,Female Internet Access (%),Female Primary School Enrolment (%),GDP per Capita
0,Austria,AUT,2002,31.8275,102.157829,39636.482610
1,Denmark,DNK,2002,59.8367,101.688759,49541.855550
2,Finland,FIN,2002,60.5637,99.311958,39351.390400
3,Germany,DEU,2002,43.7461,104.565582,34883.058070
4,Greece,GRC,2002,11.7320,102.311241,19996.465100
...,...,...,...,...,...,...
1220,Georgia,GEO,2023,82.0899,103.306000,6086.592688
1221,Kazakhstan,KAZ,2023,91.7601,100.651001,11700.836480
1222,Paraguay,PRY,2023,79.2152,89.920998,6415.429285
1223,Thailand,THA,2023,88.6092,98.211998,6384.807181


In [802]:
fig3 = px.scatter(
    final_data,
    x='Female Internet Access (%)',
    y='Female Primary School Enrolment (%)',
    color='GDP per Capita',  # Color points by GDP
    size='GDP per Capita',   # Optionally vary point sizes by GDP
    title='Impact of GDP on Internet Access and School Enrollment',
    labels={
        'Female Internet Access (%)': 'Female Internet Access (%)',
        'Female Primary School Enrolment (%)': 'Female Primary School Enrolment (%)',
        'GDP per Capita': 'GDP per Capita (USD)'
    },
    hover_name='Country Name',  # Show country names on hover
)

fig3.show()

Results are still a little unclear, but at least we see that my hypothesis doesn't seem to hold, as the yellow circles don't look like they're moving in the upper right direction.

## Step 3: Visualize again based on Income Level groups

In [803]:
# Roughly define GDP categories based on World Bank classifications of GNI per capita
# As my GDP per Capita income is defined in 2015 USD, I need to alter it using a threshold.
adj_threshold = float(1.33) #assuming 1+average annual inflation rate of 0.33

def categorize_gdp(row):
    if row >= 140000/adj_threshold:  # High income threshold (example)
        return 'High Income'
    elif  140000/adj_threshold > row >= 4500/adj_threshold:  # Middle income threshold (example)
        return 'Upper Middle Income'
    else:
        return 'Lower Middle to Low Income'

# Create a new column for GDP categories
final_data['Income Level'] = final_data['GDP per Capita'].apply(categorize_gdp)
final_data

Unnamed: 0,Country Name,Country Code,Year,Female Internet Access (%),Female Primary School Enrolment (%),GDP per Capita,Income Level
0,Austria,AUT,2002,31.8275,102.157829,39636.482610,Upper Middle Income
1,Denmark,DNK,2002,59.8367,101.688759,49541.855550,Upper Middle Income
2,Finland,FIN,2002,60.5637,99.311958,39351.390400,Upper Middle Income
3,Germany,DEU,2002,43.7461,104.565582,34883.058070,Upper Middle Income
4,Greece,GRC,2002,11.7320,102.311241,19996.465100,Upper Middle Income
...,...,...,...,...,...,...,...
1220,Georgia,GEO,2023,82.0899,103.306000,6086.592688,Upper Middle Income
1221,Kazakhstan,KAZ,2023,91.7601,100.651001,11700.836480,Upper Middle Income
1222,Paraguay,PRY,2023,79.2152,89.920998,6415.429285,Upper Middle Income
1223,Thailand,THA,2023,88.6092,98.211998,6384.807181,Upper Middle Income


In [804]:
fig4 = px.scatter(
    final_data,
    x='Female Internet Access (%)',
    y='Female Primary School Enrolment (%)',
    color='Income Level',  
    trendline='ols',  
    title='Relationship Between Internet Access and Enrollment by Income Group',
    labels={
        'Female Internet Access (%)': 'Female Internet Access (%)',
        'Female Primary School Enrolment (%)': 'Female Primary School Enrolment (%)'
    },
    hover_name='Country Name'
)
fig4.show()

## Conclusion

In Fig4, we see that countries with higher income levels exhibit weaker or even negative correlations between female internet access and school enrollment, and the opposite is true for lower income countries!! Of course, correlation does not imply causation, and a more rigorous analysis would probably undertake multivariable regression.

A possible explanation is the saturation effect. In these countries, primary school enrollment rates are already near universal, leaving little room for improvement regardless of internet access. Additionally, internet usage in high-income nations often serves non-educational purposes, and disparities within these countries (e.g., rural vs. urban divides) may limit its impact on education. In contrast, middle-income countries may show stronger correlations, as internet access plays a critical role in expanding education systems. 

Meanwhile, low-income countries face structural barriers, like limited infrastructure, which may prevent internet access from significantly improving enrollment. Furthermore, primary school enrollment might not fully capture the impact of internet access, as metrics like secondary education or literacy rates could offer clearer insights. 

These findings suggest a more complex relationship between economic development, internet access, and education, warranting further exploration. 