<h1 style="text-align:center;">Covid-19 Impact On Digital Learning</h1>
<div style="text-align:center"><img src="https://images.unsplash.com/photo-1610484826917-0f101a7bf7f4?ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&ixlib=rb-1.2.1&auto=format&fit=crop&w=1050&q=80" width="580" height="280" /><figcaption><a href="https://unsplash.com/photos/Y8TiLvKnLeg">Source: Unsplash</a></figcaption></div>

# Preface
The outbreak of the Covid-19 pandemic affected not only global economics or threatened people's health but also contributed to the widening digital divide in learning worldwide in general and across the United States in particular. So why do I say "widen"? Before the onset of the novel pandemic, there was a significant digital divide between students with and without access to high-speed internet, appropriate devices for learning such as tablets, computers. There is also a distance between high-income and low-income households so that many other factors can be considered, like geography, race/ethnicity, etc. A new analysis by Common Sense and BCG finds that the "gap" has changed negatively. Its gap is more significant than previously understood. These inequities are an incentive for me to participate in this competition [LearnPlatform](https://www.kaggle.com/c/learnplatform-covid19-impact-on-digital-learning/) to find insights that can be useful to suggest some policies or come up with some creative ideas for narrowing the gap and moving forward to equity in distance learning.

My report contains the following sections:
 *  **Introduction**
 *  **Summarize Key Findings & Recommend Policies**
 *  **Data Source & Data Structure**
 *  **EDA & Storytelling Analytics**
 *  **Conclusion**

Thank you and happy reading!

Nam Duong 


# Introduction

## Background
Before the pandemic appeared, at that time, the education sector was more focused on face-to-face learning than on distance learning. Then suddenly, everything changed. The appearance of a novel contagious virus worldwide forced governments to close education institutions in more than 130 countries, which affected more than one billion students. Schools drive more than 56 million students to transition to full-time distance learning from home in nearly all the US. To prevent classroom disruption, schools and teachers have attempted to reach students remotely through distance learning tools and digital platforms.

The expanding use of technology in both homework and distance learning caused a crisis inequity in accessing digital education, which affected students and teachers, described as the "digital divide." While nationwide, 99% of public schools have high-speed broadband access, this is not true for households. Many challenges to face when distance learning from home. One study project that learning disruption without the internet and learning devices leads to increased learning loss among America's most vulnerable learners. In this case, closing the digital divide is more critical than ever, which is long-term benefits when other crises occur in the future.

## Main Points
To support a better understanding of digital learning during the Covid-19 pandemic and the term "digital divide." I will walk you through the following main points:

- First, I describe the picture of digital learning connectivity and engagement in 2020, how this evolves in the future, the change of types of education technology over the course, etc. 
- Second, I focus on the picture of the digital distance learning divide for student segment and state-level to explore the term means. Then, answer questions about the root cause of divide, investigate infrastructure conditions, access challenges, divide in student segment & state-level, etc.
- Third, I characterize problem by demographic (race/ethnicity, income, etc.) and geography (rural,..., urban)
- Fourth, I go to technical requirements for learning from home to identify technological specifications for connectivity and devices, as well as non-technological but impact on successful distance learning.
- Fifth, I mention how to close the digital divide in the US.



# Summarize Key Findings & Recommend Policies

## Key Findings

*Digital Learning Connectivity And Engagement In 2020*
* Most states across the US driving students to transition to full-time distance learning from home.
* During the Covid-19 pandemic, Edtech start-up funding raised considerably, and the interest in Edtech products.
* Online learning has its advantages promised to become and evolve in the future, mainly when other crises occur.
* Edtech products aim to PreK-12 students the most, followed by high-school or GDE... 
* Type of primary functions is the vital factor for choosing educational products.
* Google LLC is still a giant Edtech company with various educational products used for lots of purposes such as Google Docs, Google Meet, Google Classroom, etc... Other products such as Zoom, Canvas, Youtube, Classlink, etc., also attract many students.
* There is a significant difference in the percent of access as well as engagement index by state
* The popularity of one product in one state does not mean in all states. Depends on many factors like demographics, characteristics..., products will be chosen and used differently.
* In another field such as free/reduced lunch, black/Hispanic percent in school districts, internet access ratio differs depending on locale, state, race, etc.

*Digital Divide Matter*
* Before the onset of a pandemic, the digital divide existed. Still, the speed of the digital gap among students increased considerably and seriously during the pandemic, threatening the dropout rate and loss of learning progress.
* Digital divide impacts many fields, not only education but also economic, living conditions, society, etc., so it's an urgent matter.
* What we receive from closing the digital divide is incentive promoting us, policymakers, and the government takes action.
* Three leading root causes of the digital divide: lack of availability, lack of affordability, lack of adoption. To solve the digital divide, we must solve these three causes.
* Digital divide leads to dividing students with or without internet access and adequate devices for learning from home.
* Other problems should be considered, such as the teacher gap in technology and noticeable trends in 2020.

*Digital Divide In Different Demographics*
* Students in low-income households are less likely to access the internet or own appropriate devices for distance learning.
* Students who live in rural areas tend to suffer more poor conditions than other areas, so the rate of students with internet or devices is also lowest.
* White and Asian children with internet access are higher than the average rate of internet access. In contrast, Black and Hispanic are nearly average, and American Indian and Native Alaskan children are lower than that.

## Recommend Policies (Any creative and practical solutions are welcome)

* Promoting program about widening internet network which everyone could access. 
* Researching breakthrough technology in internet connection. 
* Launching policies for attracting telecommunications companies' investment in internet services & infrastructure. 
* Decreasing the cost of the internet. 
* Holding a meeting among telecommunication companies to talk about customer care and purchase preference policy. 
* Driving customer benefits policies toward the most vulnerable.
* Allowing for transparency in pricing and encourage bulk-purchasing efforts of devices & computers by states and districts.
* Raising awareness of low-cost broadband service offerings and broadband service cost-support programs. 
* Governors must be committed to creating the most favorable conditions for maintaining the necessary supply chain in the short term to solve the gap during Covid-19. 
* Government funds should be invested more in that industry and considering chip manufacturing industry is a basis which other manufacturers rely upon. 
* Addressing parent's potential lack of trust and skepticism of technology solutions.


In [None]:
# Import related libraries 
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import os, glob
import plotly.express as px
from plotly.offline import plot, iplot, init_notebook_mode
init_notebook_mode(connected=True)
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker- load

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session as px

# Data Source & Data Structure

## Data Source
- Data provided by host: learnplatform-covid19-impact-data
- External Data: National Telecommunications and Information Administration, US Census Bureau, US Deparment of Education, Pew Research Center, BroadBandNow, American Community Survey (ACS), USAfacts, etc

## Data Structure
- For more information learnplatform-data. Please follow the path:  ../input/learnplatform-covid19-impact-on-digital-learning/README.md
- External Data: In my report, I will embed and take advantage of available charts or maps provided by owner in someplace because some raw datasheet cannot be get and for runtime purpose. 

# EDA & Detailed Analysis

Let's have a quick look and get familiar with dataset we will deal with!!!

In [None]:
# Load products_info data file
product_info = pd.read_csv(r'/kaggle/input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')
product_info.head()


In [None]:
print('The shape of product_info data:', product_info.shape) # check the rows and columns of this dataset
print('---------')
print('Data type: ')
print(product_info.dtypes)   # check datatypes to see if there are some wrongly categorized type
print('---------')
print('Null data: ')
print(product_info.isnull().sum())   # check each columns number of null values
print('---------')
print('The number of unique in data: ')
print(product_info.nunique()) # check each columns number of unique values 


In [None]:
# Load districts_info data file
districts_info = pd.read_csv(r'/kaggle/input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')
districts_info.head()

In [None]:
print('The shape of ditricts_info data:', districts_info.shape) # check the rows and columns of this dataset
print('---------')
print('Data type: ')
print(districts_info.dtypes)   # check datatypes to see if there are some wrongly categorized type
print('---------')
print('Null data: ')
print(districts_info.isnull().sum())   # check each columns number of null values
print('---------')
print('The number of unique in data: ')
print(districts_info.nunique()) # check each columns number of unique values 

In [None]:
# I concat each files in engagement file together for convinient handling
#path = r"C:\Users\ADMIN\digitallearning_Kaggle\engagement_data"
#all_files = glob.glob(os.path.join(path, "*.csv"))
#df_from_each_file = []   # create empty list to store df 
#for f in all_files:
#   df = pd.read_csv(f, sep=',')
#   df['district_id'] = f[-8:-4]
#   df_from_each_file.append(df)

#f_merged   = pd.concat(df_from_each_file, ignore_index=True)  #concat multiple file to one file
#df_merged.toonvinienc # I add this files to input for convinience and saving runtimeeged.csv")


In [None]:
# Load file merged engagement
engagement_merged = pd.read_csv('/kaggle/input/merged/merged.csv')
engagement_merged.drop(columns=['Unnamed: 0'], inplace=True)
engagement_merged.head()

In [None]:
print('The shape of engagement_merged data:', engagement_merged.shape) # check the rows and columns of this dataset
print('---------')
print('Data type: ')
print(engagement_merged.dtypes)   # check datatypes to see if there are some wrongly categorized type
print('---------')
print('Null data: ')
print(engagement_merged.isnull().sum())   # check each columns number of null values
print('---------')
print('The number of unique in data: ')
print(engagement_merged.nunique()) # check each columns number of unique values 
 


In [None]:
product_info[product_info['Provider/Company Name'].isnull()]
product_info[product_info['Sector(s)'].isnull()]
# We see that NaN in this case maybe useful for analyze because their products name are quite popular and their sector used not clearly to label so I will keep it and instead label NaN, I replace columns (Sector, Primary Essential Function) contain NaN by 'flexible' respectively
# Additionally, URL columns may not useful so I will remove it for convinience

In [None]:
product_info.fillna({'Sector(s)':'flexible','Primary Essential Function':'flexible','Provider/Company Name':'unknown'},inplace=True) #fill na values
product_info.drop(columns=['URL'],inplace=True) # drop uneccesary column
product_info.rename(columns={'LP ID':'lp_id','Product Name':'product_name'
                    ,'Provider/Company Name':'provider/company_name'
                    ,'Sector(s)':'sector','Primary Essential Function':'primary_essential_function'}, inplace=True) # rename column for conviniently handling
product_info.isnull().sum().any()  #check null again

In [None]:
engagement_merged = engagement_merged[engagement_merged['pct_access'].notna()] # drop nan in pct_access column
engagement_merged = engagement_merged[engagement_merged['lp_id'].notna()] # drop nan in lp_id column
(engagement_merged[engagement_merged['engagement_index'].isnull()]['pct_access'] != 0).sum() # we see that along with 0.0 in pct_access is NaN so i will fill it by 0
engagement_merged.fillna(0,inplace=True)


For more understanding of what was presented later, I encourage you to look at the infographics below to get a brief knowledge about how the state response to Covid-19 and the calendar of the academic year in the US from 2019-2020. If you get familiar with it, skip it and go on following sections!
<div style="text-align:center"><img src="https://www.kff.org/wp-content/uploads/2020/04/WEB2-Stay-at-Home-Orders-by-State_1-2.png" width="709" height="380" /></div>
<div style="text-align:center"><img src="https://www.calendarpedia.com/images-large/school/2019-2020-school-calendar-l.png" width="680" height="580" /></div>


Education 21st century relies strongly on technology so that more and more technological companies & start-ups worldwide join in the educational platform that supports distance learning better, especially in the Covid-19 pandemic. These charts below help us know general about the impact of the pandemic on students' education, the change of Edtech capital raised in 2020, and additionally, sectors of schooling that their products support and have a general view about a picture of digital learning.

In [None]:
%%HTML
<iframe width="750" height="520" frameborder="0" src="https://usafacts.org/articles/65-of-childrens-education-has-moved-online-during-covid-19/embed/0/?"></iframe>
<iframe width="750" height="520" frameborder="0" src="https://usafacts.org/embed/chart/?autosize=false&axisTextColor=%23616161&chartType=1&margins=%7B%22top%22%3A0%7D&metrics=%5B%7B%22id%22%3A%22sotu2151%22%2C%22allStates%22%3Atrue%7D%5D&selectableYears=false&sortRows=false&source=SOTU&title=Percentage%20of%20households%20with%20children%20reporting%20use%20of%20online%20distance%20learning%3A"></iframe>

Across the US, 65% of households reported that their children's classes move to distance learning using online resources, while 15% of households transitioned to distance learning with paper materials sent home.\
Due to the pandemic, 26% of households reported that classes were canceled at some point during the school year. In 11% of cases, parents or guardians responded that the "pandemic did not affect how children in this household received an education."\
Schooling during the pandemic also varies by state. 84-86% of households with children in Washington, Arizona, and New Mexico are learning remotely. In Wyoming and South Dakota, 25-28% are learning online. The map shows that most states located on the east and west coast have distance learning rates higher than states in the middle of the US.


In [None]:
%%HTML
<div style="text-align:center"><img src="https://edsurge.imgix.net/uploads/photo/image/8548/2020-1610502497.jpg" width="600" height="500" /></div>

During the Covid-19 pandemic, there has been a dramatic increase in the number of students studying online in the U.S. since March 2020. This most disruptive year to schools and society proved lucrative for the education industry, particularly for those raising private capital. In 2020, U.S. education technology startups raised over \\$2.2 billion in venture and private equity capital across 130 deals, according to the EdSurge ed-tech funding database. That's a nearly 30 percent increase from the \\$1.7 billion invested in 2019, which was spread across 105 deals, according to [EdSurge](https://www.edsurge.com/news/2021-01-13-a-record-year-amid-a-pandemic-us-edtech-raises-2-2-billion-in-2020).

The \\$2.2 billion marks the highest investment total in a single year for the U.S. tech industry. It's not an anomaly because a [report from C.B. Insights](https://www.cbinsights.com/reports/CB-Insights_MoneyTree-Q4-2020-Headline.pdf) showed that investments for all venture-backed U.S. companies reached a record \\$130 billion in 2020 or up 14 percent from 2019.


**The Future: Physical vs. Online**

A few datasets from the U.S. Department of Education, however, reveal a shift to online education. According to the National Center for Education Statistics, 21% of public schools offered at least one course entirely online during the 2017-2018 school year. At that time, only 5.7% provided a majority of all classes online.\
In a 2019 report, 85% of district administrators said using digital learning resources was a high priority. Even 79% of district administrators said they provide a range of these programs. The 2020-2021 school year, at least during the pandemic, changed how most children learn. Government data in the coming years could show whether that impact is temporary or longer-lasting.\
But anyway, Covid-19 has altered the landscape of education. Even large universities are adopting e-learning and thus, in turn, saving on investment in more physical infrastructure. This is also making education more flexible, accessible, and affordable. The Covid-19 pandemic and more crises will occur in the future, so online learning is the best solution. It will be greater if we solve the "digital divide" problem successfully! We will provide a more detailed discussion on this later. Now, we are going to mining some insights and knowledge based on the data collected below. 



In [None]:
# plot the popular sector of eduction where product is used
plt.figure(figsize = (15,8))
ax = sns.countplot(data = product_info,  y = 'sector', color='grey')
ax.set_ylabel('Sector(s)')
ax.set_xlabel('Frequencies')
ax.set_title('The Popular Sector Of Education Where Product Is Used')
#set orange color for the highest bar   
patch_h = []    
for patch in ax.patches:
    reading = patch.get_height()
    patch_h.append(reading)
# patch_h contains the heights of all the patches now
idx_tallest = np.argmax(patch_h)   
# np.argmax return the index of largest value of the list
ax.patches[idx_tallest].set_facecolor('orange')  

We see that most educational products target PreK-12 students. If do not know, "PreK-12" indicates the range of years of publicly supported primary and secondary education found in the United States (from the previous kindergarten to 12th grade). It is understandable why most products use in the PreK-12 sector. The reason is simply that according to National Center for Education Statistic (NCES), in 2020, about 56.4 million students are projected to attend elementary, middle, and high schools across the United States, compared to PreK-12, students are projected to attend colleges and universities (Higher Ed) was about 19.7 million. Additionally, PreK-12 is an important period in their learning path; students need lots of educational tools to support their growth and help them study subjects more friendly and intuitively, saving time to understand and encourage their self-learning. In contrast, traditional educational approaches are not easy to do those things. Next, let us a closer look at these charts below for more information.

In [None]:
# plot top provider/company which have the most a variety of products
plt.figure(figsize=(12,8))
ax = sns.barplot(data =product_info['provider/company_name'].value_counts().sort_values(ascending=False).to_frame().reset_index().head(15), x='provider/company_name', y = 'index')
ax.set_xlabel('The Number Of Most Product')
ax.set_ylabel('Provider/Company')
ax.set_title('Top Provider/Company Have A Variety Of Products')

In [None]:
# plot common primary essential function of product
plt.figure(figsize=(15,8))
ax = sns.countplot(data=product_info, y = 'primary_essential_function')
ax.set_title('Common Primary Essential Function Of Product')
ax.set_xlabel('Frequencies')
ax.set_ylabel('Primary Essential Function')

In [None]:
plt.figure(figsize=(12,8))
for x in list(product_info['sector'].unique()[0:3]):
    ax = sns.countplot(data = product_info[product_info['sector'] == str(x)], y='primary_essential_function')
    ax.set_title(str(x))
    ax.set_xlabel('Frequencies')
    ax.set_ylabel('Primary Essential Function')
    plt.show()

Google LLC is the company that has a variety of products the most. Google LLC was known as an American multinational technology company specializing in Internet-related services and products, founded in 1998. According to a newspaper article "How Google Take Over The Classroom" in The New York Times, more than half the nation’s primary- and secondary school students — more than 30 million children — use Google education apps, the company says. This explained that Google has the most various products that meet the needs of both students & teachers, which open a new educational ecosystem era for Google LLC. Although Microsoft, along with some companies - Adobe Inc, Learning A-Z-fell farther and farther behind Google in various products, they still have their signature and favorite products such as Microsoft 365 Education, OneNote, Teams, Etc.

About primary essential functions that products provide, there are differences among levels of education. In an overview, a digital learning platform, or DEP for short, is the most popular essential function many products provide. As the name suggests, a DEP is an online ‘environment’ comprising applications and tools for the education sector. It is used by teachers, administrative staff, and students – and is explicitly designed to accelerate digital teaching and learning for schools. In short, these platforms are a collection of software and tools that all work together for education organizations and professionals – and the communities they serve. Microsoft Office 365 Education or Google's G Suite for Education is a marvelous example of DEP. Besides, some popular essential functions can be mentioned, such as Sites-Resources & Reference, Games & Simulation, Study Tools, Courseware & Textbooks. These functions are a great addition to students for self-study. PreK-12, the DEP is the most popular in a particular sector, followed by Sites-Resources & Reference, Games & Simulation... In higher education, DEP is not the most but Study tools, appropriately, this can be explained that students in higher education, where subjects are studied at an advanced level, focus on self-study ability so that they tend to need more study tools to support in their homework, researches or even personal projects than lower education. At a corporate level, the essential function is Content Creation & Curation. We can conclude that the change of popular primary essential functions depends on the goal of each educational level or corporate aim.

In [None]:
#save row with nan values to private data frame which may be used later
district_info_anonymous = districts_info[districts_info['state'].isna() | districts_info['locale'].isna() | districts_info['pct_black/hispanic'].isna() 
                                         |districts_info['pct_free/reduced'].isna() | districts_info['county_connections_ratio'].isna() | districts_info['pp_total_raw'].isna()]
# save complete data frame 
districts_info_complete = districts_info.dropna()

In [None]:
df_copy = engagement_merged.copy()
product_merge_available = pd.merge(df_copy, product_info[['lp_id', 'product_name']], on = 'lp_id')
# function return number of top product based on method such as mean or median of pct_access or engagement_index and plot in word cloud to show popular product
def top_product_available(interest, method_wanted, amount): 
    top_product = product_merge_available.groupby(['product_name'])[str(interest)].agg(method_wanted).sort_values(ascending=False).head(amount).reset_index()

    primary_func = []
    for name in list(top_product['product_name'].values):
        d = product_info[product_info['product_name']==name]['primary_essential_function'].values[0]
        d = d.split('- ',1)[-1]
        primary_func.append(d) 

    from wordcloud import WordCloud
    fig = plt.figure(figsize=(15,10))
    fig.add_subplot(121)
    wordcloud = WordCloud (
                        background_color = 'black',
                        width = 812,
                        height = 684
                            ).generate(' '.join(list(top_product['product_name'].values)))
    plt.imshow(wordcloud) # image show
    plt.axis('off') # to off the axis of x and y

    fig.add_subplot(122)
    wordcloud = WordCloud (
                        background_color = 'black',
                        width = 812,
                        height = 684
                            ).generate(' '.join(primary_func))
    plt.imshow(wordcloud) # image show
    plt.axis('off') # to off the axis of x and y
    return list(top_product['product_name'].values), plt.show()


In [None]:
list_product = top_product_available('pct_access', np.mean, 10) # plot popular product and assign list top 10 products used by mostly student 


The left chart shows top 10 products have a high average of percent access by students. We get familiar with many products, such as Youtube, Drive, Google Classroom, etc. As mentioned above, most of the products in the top 10 belong to Google LLC, which demonstrates that Google LLC has a large ecosystem. Indeed, they are a leader in educational technology, and these facts are indisputable. The right chart is about primary essential functions provided by-products, we see vast numbers of diversity in the kinds of functions, this helps us to change our mind that digital learning or educational technology not only include tools with tedious functions but they have full useful functions for students, teacher or even corporate. Next, to acquire more profound knowledge, we will consider variables such as percent access, engagement index, and page load per student in-depth. 

In [None]:
# function plot pct_access or engagement_index over time in engagement_merged dataframe (any products include products don't exist in product_info dataframe)
def access_overtime_any_product(interest, method_wanted):  # interest is pct_access or engagement_index & method is np.mean or np.median
    #Using plotly.express
    import plotly.express as px
    overtime = pd.DataFrame(engagement_merged.groupby(['time'])[str(interest)].agg(method_wanted)).reset_index()
    fig = px.line(overtime, x='time', y=str(interest), title=str(interest)+' Any Products Including Products Not Recorded In The US Over Time')
    return fig.show()
# function plot pct_access or engagement_index over time in product_merged_available dataframe (include products exist or recorded in product_info dataframe)
def access_overtime_available_product(interest, list_product_considered, method_wanted):  # interest is pct_access or engagement_index & method is np.mean or np.median
    temp = product_merge_available[product_merge_available['product_name'].isin(list_product_considered)]
    #Using plotly.express
    import plotly.express as px
    overtime = pd.DataFrame(temp.groupby(['time'])[str(interest)].agg(method_wanted)).reset_index()
    fig = px.line(overtime, x='time', y=str(interest), title=str(interest)+' Of '+str(len(list_product_considered))+' Chosen Products In The US Over Time')
    return fig.show()


In [None]:
# plot distribution pct acces and engagement_index to any products icluding products not recorded in product_infor dataframe
access_overtime_any_product('pct_access', np.mean)
access_overtime_any_product('engagement_index', np.mean)
# plot distribution pct acces and engagement_index to top 10 products most popular
access_overtime_available_product('pct_access', list_product[0], np.mean)
access_overtime_available_product('engagement_index', list_product[0], np.mean)

Four line charts represent the change of percentage of access to products or engagement index over time. All charts above are similar in shape, in 2020 first-half year (from January to May) and last-half year (from September to December), percentage of access as well as engagement index fluctuate wildly, which means we subtract the highest percentage of access by the lowest percentage of access the value calculated is much greater than that in the period from May to Agust. Moreover, the same thing also occurs in the engagement index line chart. So what happened? One reasonable hypothesis based on school terms is as the school year in the US usually runs from early September until May or June (nine months) and is divided into ‘quarters’ or terms (semesters). Most schools use a semester system of two sessions: fall (September to December) and spring (January to May). Some schools use the quarter system, which comprises three sessions: fall (September to December), winter (January to March), and spring (March to May or June). We can infer that students take summer break (May to August), so at this time, both percent access and engagement with educational products decreased substantially. Then after traditional summer vacation, the American school year traditionally begins at the end of August or early in September, while more and more daily coronavirus cases are recorded as well as the number of deaths daily in the following months (see in the below chart), learning face-to-face was impossible so learning from home was a solution.
Consequently, the percentage of the access and engagement index increased again. Notice that what we rely on may not an entirely correct due to confounding factors, but it is still good enough to explain where these differences come from and answer what happened. It is a good place for us to expand more hypotheses about that.

In [None]:
%%HTML
<div class='tableauPlaceholder' id='viz1629188702726' style='position: relative'><noscript><a href='https:&#47;&#47;covidtracking.com&#47;'><img alt='0_All Key Metrics ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;CT&#47;CTPWebsiteGallery&#47;0_AllKeyMetrics&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='CTPWebsiteGallery&#47;0_AllKeyMetrics' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;CT&#47;CTPWebsiteGallery&#47;0_AllKeyMetrics&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en' /><param name='filter' value='publish=yes' /><param name='origin' value='viz_share_link' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1629188702726');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='1000px';vizElement.style.height='627px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

In [None]:
engagement_merged['page_load_per_student'] = engagement_merged['engagement_index'] / 1000 # add columns page load per student
result = pd.merge(engagement_merged, districts_info[['district_id','state','locale','pct_black/hispanic','pct_free/reduced','county_connections_ratio','pp_total_raw']]
                 , on="district_id")

In [None]:
# function display table summarize pct_acces or engagement_index per state or locale
def display_table_summarize(interest, state_or_locale):
    summarize = result.groupby(str(state_or_locale))[str(interest)].describe()
    # assign cmap
    cm = sns.light_palette("green", as_cmap=True)
    # display summarize pct_acces per state
    if (str(interest) == 'pct_access') & (str(state_or_locale) == 'state'):
        sm = summarize.style.set_table_attributes("style='display:inline'").set_caption('Summarize Pct_Access Per State').background_gradient(cmap=cm)
    # display summarize engagement_index per state
    elif (str(interest) == 'engagement_index') & (str(state_or_locale) == 'state'):
        sm = summarize.style.set_table_attributes("style='display:inline'").set_caption('Summarize Engagement_Index Per State').background_gradient(cmap=cm)
    # display summarize pct_acces per state
    elif (str(interest) == 'pct_access') & (str(state_or_locale) == 'locale'):
        sm = summarize.style.set_table_attributes("style='display:inline'").set_caption('Summarize Pct_Access Per Locale').background_gradient(cmap=cm)
    # display summarize engagement_index per state
    elif (str(interest) == 'engagement_index') & (str(state_or_locale) == 'locale'):
        sm = summarize.style.set_table_attributes("style='display:inline'").set_caption('Summarize Engagement_Index Per Locale').background_gradient(cmap=cm)
    else:
        return 'Interest variable is not available!! Please retype one of the following: pct_access or engagement_index & state or locale'
    return display(sm)

For more detail, the tables below show descriptive statistics about two variables: percentage of access and engagement index based on state and locale. We look at some essential statistics such as mean, standard deviation, Etc.\
In state-level analysis, four states with an average of the highest percentage of access are North Dakota, Arizona, New York, and New Hampshire, respectively, but the standard deviation of the percentage of access in that state is also high caused by extreme values. We look at max values by states, compared to a min, median, or mean, these max values are much more significant, while considered Q3 (75%),  75% day in 2020 most of states have a daily percentage of access smaller than 0.8 % except North Dakota. The same interpretation applied to the engagement index table. 

In [None]:
display_table_summarize('pct_access', 'state')
display_table_summarize('engagement_index', 'state')


Whether the difference of percent of access by state occurred by chance?  Using hypothesis testing here may be a good choice.
With the distribution of collected data about percent access, we use Kruskal Wallis Test. For reference, please visit paper [here](http://webspace.ship.edu/pgmarr/Geo441/Readings/Kruskal%20and%20Wallis%201952%20-%20Use%20of%20Ranks%20in%20One-Criterion%20Variance%20Analysis.pdf)

> **Null Hypothesis (H0):** There is no difference in the percent of access by state.\
**Alternative Hypothesis (H1):** There is a significant difference in the percent of access by state.

We see the result below show that the p-value is very small, so that we reject the null hypothesis (H0) and accept the alternative hypothesis (H1). **Then, the difference of percent of access by state is significant, not occurred by chance**—the same with the engagement index.

In [None]:
# function kruskal test
def kruskal_test(access_or_engagement):
    #  each state with their pct_access
    kruskal_dict = {}
    state_list = list(result['state'].unique())
    for state in state_list:
        kruskal_dict[state] = result[result['state']==state][str(access_or_engagement)]
    # kruskal test
    from scipy import stats
    F, p = stats.kruskal(kruskal_dict['Connecticut'],kruskal_dict['Missouri'],kruskal_dict['Illinois'],
        kruskal_dict['Utah'],kruskal_dict['Indiana'],kruskal_dict['New York'],kruskal_dict['Virginia'],kruskal_dict['California'],
        kruskal_dict['Washington'],kruskal_dict['New Hampshire'],kruskal_dict['Massachusetts'],kruskal_dict['Ohio'],kruskal_dict['North Dakota'],
        kruskal_dict['Minnesota'],kruskal_dict['North Carolina'],kruskal_dict['Michigan'],kruskal_dict['Texas'],kruskal_dict['District Of Columbia'],
        kruskal_dict['Wisconsin'],kruskal_dict['Tennessee'],kruskal_dict['Florida'],kruskal_dict['New Jersey'],kruskal_dict['Arizona'])
    return print("Test of {0}, p-value: {1}".format(access_or_engagement,p))
    # for convinience to input to Kruska test i code below and copy it instead type by hand
    # for state in state_list:
    #     print("kruskal_dict['%s']," % state)

In [None]:
kruskal_test('pct_access')
kruskal_test('engagement_index')

In [None]:
plt.figure(figsize=(12,8))
ax = sns.barplot(data = pd.DataFrame(districts_info.groupby(['state'])['district_id'].count().sort_values(ascending=False)).reset_index() , y = 'state', x = 'district_id', color='green')
ax.set_xlabel('The Number Of District Per State')
ax.set_ylabel('State')
ax.set_title('The Number Of School District In Each State')

In locale-level analysis, the average rural access percentage was far different from our thought or even in real life. It is higher than that of other locales. We know that rural is an undeveloped area with low population density and suffer many disadvantages but has the highest mean of the percentage of access, which is quite counterintuitive due to violating validity & consistency.  One reason can be the distribution of the number of school districts in each locale (the chart below shows that).\
We have: $$Mean = {{x_1+x_2+...+x_n} \over n}$$

n is the sample size. In this case, n is the number of school districts in each locale. When the sum of x is much greater than n, the mean is more significant and vice versa. This is a sensible property of mean value, which some extreme values can cause skewness in distribution. The number of school districts in the suburbs was higher than that in the rural, leading to different mean values. Besides, some other factors affected too, but unknown.

Notice that variable percentage of access described as a percentage of students in the district have at least one page-load event of a given product and on a given day does not entirely make sense as well as not enough data to conclude anything indeed so that in the following sections we are considering some other meaningful variables to get more confidence in the conclusion.

In [None]:
display_table_summarize('pct_access', 'locale')
display_table_summarize('engagement_index', 'locale')

In [None]:
# plot distribution the number of school district by locale
district_amount = {} 
for locale in ["City", "Rural", "Suburb", "Town"]:
    district_amount[locale] = result[result['locale'] == locale]["district_id"].nunique()
dis_amount_df = pd.DataFrame.from_dict(district_amount, orient='index', columns=['Count']).reset_index().rename(columns = {'index':'locale'})
plt.figure(figsize=(10,6))
ax = sns.barplot(data=dis_amount_df, x = 'locale', y='Count')
ax.set_title("Distribution of school districts by locale")

Continuing with the picture of digital learning in the US in 2020, we next discuss on the change in percentage of access in top 4 popular products.

In [None]:
# sort descending product_id with highest pct_access to lowest
max_user_access = result.groupby(['lp_id'])['pct_access'].agg(np.max).sort_values(ascending = False).to_frame().reset_index() 
# we merge column lp_id, pct_access, product_name together to make a new dataframe
product_access = pd.merge(engagement_merged[['time','lp_id', 'pct_access','engagement_index']],product_info[['lp_id', 'product_name']], on = 'lp_id')
product_access = product_access.sort_values(['time'])
product_access = product_access.reset_index().drop(columns = ['index','lp_id'])

In [None]:
# function return product name with given product ID
def id_product_name(list_product_considered):
    most_access = list(list_product_considered)
    name = {}
    for x in most_access:
        if (product_info['product_name']==x).any().sum():
            n = product_info[product_info['product_name']==x]['lp_id'].values[0]
            name[n] = x
        else:
            return 'Not found'
    return name

# function plot pct_access, engagement_index or page_load_per_student over time given a product in given a district
def plot_interest_by_product(interest, id_product, district_id):
    fig = px.line(result[(result['lp_id']==id_product) & (result['district_id']==district_id)], x='time', y=str(interest), 
                  title='Pecentages Of Student In District Have At Least 1 Page Load Of '+ str(id_product))
    return fig.show()

# function plot distribution of pct_free/reduced or pp_total_raw in each given pct_black/hispanic
def plot_distribution_based_black_hispanic(interest, hue = 'pct_black/hispanic'):
    plt.figure(figsize=(12,9))
    ax = sns.countplot(data=result.sort_values([str(interest)]), x = str(interest), hue='pct_black/hispanic')
    if str(interest) == 'pct_free/reduced':
        ax.set_xlabel('Percentage Of Students Eligible For Free Or Reduced Price Lunch')
        ax.set_ylabel('Frequencies')
        ax.set_title('Distribution Of Percentage Of Free/Reduced Lunch Given A Range Of Percentage of Black/Hispanic')
    elif str(interest) == 'pp_total_raw':
        ax.set_xlabel('Per-Pupil Total Expenditure School-By-School')
        ax.set_ylabel('Frequencies')
        ax.set_title('Distribution Of Per-Pupil Total Expenditure Given A Range OF Percentage of Black/Hispanic')
        ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)
        plt.show()
    else:
        return 'Interest variable is not available!! Please retype one of the following: pct_free/reduced or pp_total_raw'
    return plt.show()

## function return number of product max access in given time (calulated by mean pct_access or engagement_index)
def max_product_interest(interest, amount): # amount is the number of rank_max expected
    c = [str(x+1)+'st Max' for x in range(amount)]
    pivot_df = product_access.groupby(['time','product_name'])[str(interest)].agg(np.mean).unstack().reset_index()
    df = (pivot_df.set_index('time')
            .apply(lambda x: pd.Series(x.nlargest(amount).index, index=c), axis=1)  # return n product_name has largest mean pct_access over The US in given time
            .reset_index())
    return df

# function plot Frequencies of product ranked by calculating mean of pct_access or engagement_index over month 
def plot_product_month(interest, rank_number_max, amount):
    temp = max_product_interest(interest,amount)
    temp['month'] = pd.DatetimeIndex(temp['time']).month  # add column month
    temp = pd.DataFrame(temp.groupby(['month'])[str(rank_number_max)+'st Max'].value_counts()).rename(columns={str(rank_number_max)+'st Max':'count'}).reset_index()
    fig = px.bar(temp, x="month", y="count", color=str(rank_number_max)+'st Max', title="The Change Of "+str(rank_number_max)+'st Max'+' Product Over Month ('+str(interest)+')')
    return fig.show()

#function plot mean daily page_load_per_student by state and product_name
def mean_daily_pageload_or_access(list_id_product, group_by, interest): # group_by is state or product_name (choose one out of them) 
    temp = result[result['lp_id'].isin(list_id_product)]    
    temp = pd.merge(temp, product_info[['lp_id', 'product_name']], on = 'lp_id')
    temp =  temp.groupby([str(group_by)])[str(interest)].agg(np.mean).to_frame().reset_index().sort_values(by = str(interest),ascending = False)
    plt.figure(figsize=(10,6))
    if str(group_by) == 'product_name':
        ax = sns.barplot(data = temp, y = str(group_by), x = str(interest), color='orange')
        ax.set_xlabel('Mean Daily '+str(interest))
        ax.set_title('Mean Daily '+str(interest) +' In Top '+ str(len(temp)) +' Products In The US')
        ax.set_ylabel('Product Name')
    elif str(group_by) == 'state':
        clrs = [sns.color_palette("Set2")[4] if (x < temp[str(interest)].mean()) else sns.color_palette("Set2")[1] for x in temp[str(interest)]] # set bars chart smaller than mean are green, else orange
        ax = sns.barplot(data = temp, y = 'state', x = str(interest), palette=clrs)
        ax.axvline(temp[str(interest)].mean(), color='b', linestyle='--', label="Mean") # set mean line 
        ax.set_title('Mean Daily  '+ str(interest)+' '+ str(len(list_id_product)) +' Given Products Per Student By State')
        ax.set_xlabel('Mean Daily '+str(interest))
        ax.set_ylabel('State')
    else:
        return 'Not match! Please retype one of the following: product_name or state'
    return plt.show()

In [None]:
max_product_interest('pct_access',4) # example of dataframe return top 4 products over time

In [None]:
# plot 1st Max ranked products over month to observe the change among them and differences when based on calculating pct_access and engagement_index
plot_product_month('pct_access',1,4) 
plot_product_month('engagement_index',1, 4)

In the first place products, first two months of the year, Google Docs had most access by students. Google Docs includes an online word processor, spreadsheet, and presentation editor if you do not know. Students and teachers can use these tools to collaborate on assignments, projects, newsletters, and blogs, among other things. In this way, Google Docs can promote teamwork and are so prevalent in school districts. A Cloud-Based Software Engineering Learning Environment report, the frequency, and percentage of the students who used Google Docs up to 71%.\
Since April, Google Docs was replaced by Google Classroom. Before explaining that, come back to the context in January and Feburary (2020). According to The Washington Post, U.S. intelligence agencies were issuing ominous, classified warnings in January and February about the global danger posed by the coronavirus while President Trump and lawmakers played down the threat and failed to take action that might have slowed the spread of the pathogen, according to U.S. officials familiar with spy agency reporting. At that time, Americans seemed not to care and understand what is coming to them. Before school closed, Google Docs is a tool used daily for its function provided, not mean it is a critical tool for online learning. Until school closed in March due to the outbreak of pandemic and afterward, learning became a distance learning result in Google Classroom - the most suitable tool than just Google Docs. Google Classroom allows educators can use the platform to create a virtual classroom, invite students to attend live instruction, and record students' grades. The primary purpose of Google Classroom is to simplify and streamline sharing files between students and teachers, which is why Google Classroom replaced Google Docs to become the most accessed product. As I know, U.S. students do have not to join classes during summer break, but this is not true for all students. Some students register in summer classes so that in June, roughly half is Youtube for entertainment, and the rest is Google Classroom for virtual learning.\
About the engagement index, Google Docs is still the majority because of the properties of the product. As described, the engagement index is the total page-load events per one thousand students of a given product and on a given day. It is challenging to understand what it means. There are many different ways to understand. Supposed that it means frequency we deal with, such as creating a new page, editing, loading a new page, or anything else. In this case, imagine that when you use Google Classroom, you need to join a virtual class, listen to your professor and write down a notebook; you cannot make so many page-load compared to when you use Google Docs, right? Alternatively, If page load events mean the frequency of particular loading products, do not worry about it too much. The engagement index in these cases is just an additional reference, considering page-load-per-student derived from engagement index later.\
On the other hand, not only products such as Google Classroom or Google Docs, but various products appeared at the third and fourth accessed product over month shown in the chart below for your exploration. 



In [None]:
# 3st Max
plot_product_month('pct_access',3,4)
# 4st Max
plot_product_month('pct_access',4,4)

Look more detailed; most states have a page load per student greater than average are on the east coast of the U.S. The same occurred in the percentage of access. The difference is caused by location and demographics, such as covid case, death, or restricted policies in each state. For more information about Covid's total case by state, please visit [United State Coronavirus Outbreak](https://www.bloomberg.com/graphics/2020-united-states-coronavirus-outbreak/).

In [None]:
top_10products_id = id_product_name(list_product[0]) # top 10 products considered
mean_daily_pageload_or_access(top_10products_id,'state', 'page_load_per_student') # page_load_per_student
mean_daily_pageload_or_access(top_10products_id, 'state','pct_access') # pct_access


To know mean daily percentage of access & page load per student in top 10 product. See more in these chart below 

In [None]:
mean_daily_pageload_or_access(top_10products_id, 'product_name', 'page_load_per_student') # page_load_per_student
mean_daily_pageload_or_access(top_10products_id, 'product_name', 'pct_access') # pct_access

In [None]:
# function plot the change or difference of chosen products by all state exist in recorded dataset
def plot_product_pageload_change_by_state(list_id_product):
    temp = result[result['lp_id'].isin(list_id_product)]    
    temp = pd.merge(temp, product_info[['lp_id', 'product_name']], on = 'lp_id')
    temp = temp.groupby(['product_name','state'])['page_load_per_student'].agg(np.mean).to_frame().reset_index().sort_values(by = 'page_load_per_student',ascending = False)
    state_l = list(temp['state'].unique())
    # dynamic subplot layout
    num_plots = len(state_l)
    total_cols = 3
    total_rows = num_plots//total_cols + 1
    fig, axs = plt.subplots(nrows=total_rows, ncols=total_cols,
                            figsize=(5*total_cols, 5*total_rows), constrained_layout=True)
    for i, var in enumerate(state_l):
        row = i//total_cols
        pos = i % total_cols
        plot = sns.barplot(data=temp[temp['state'] == state_l[i]], y='product_name', x = 'page_load_per_student', color='lightgreen', ax=axs[row][pos])
        plot.title.set_text(str(state_l[i]))

In [None]:
plot_product_pageload_change_by_state(top_10products_id) # plot top 10 product most popular

Google Docs remains a top product has the most daily page-load in the majority, except in Michigan & Texas. In Michigan, students may be interested in Meet than Google Classroom, but Texas is interested in Youtube. In general, the head position is not so much different by state, but we see many differences at rest due to characteristics, demographics, education, Etc. We all know that the United States is a multiracial country, so the differences in how they used daily or interested in which products are usual.
If you are more curious about how the percentage of access in the top 10 products changes over weeks, let us look at the following charts.

In [None]:
# function plot distribution of interest such as pct_access, engagement_index or page load per student by state over week with optional list id products
def plot_interest_by_state_over_week(interest, list_id_product, method_wanted, list_2_state_highlighted): # method is np.mean or np.median
        temp = result[result['lp_id'].isin(list_id_product)]    
        temp = pd.merge(temp, product_info[['lp_id', 'product_name']], on = 'lp_id')
        temp['time'] = pd.to_datetime(temp['time'])
        temp['week'] = temp['time'].dt.isocalendar().week
        # simple line chart with Plotly Express
        import plotly.express as px
        df_used_plot = temp.groupby(['week','state'])[str(interest)].agg(method_wanted).to_frame().reset_index()
        # sort the dataframe
        sorted_df = df_used_plot.copy() 
        # map the value order
        sorted_df["order"] = sorted_df["state"].map({list_2_state_highlighted[0]:1, list_2_state_highlighted[1]:2}).fillna(3)
        # sort by this order
        sorted_df.sort_values(by=["order","week"], ascending=False, inplace=True)

        fig = px.line(sorted_df, 
                x="week", 
                y=str(interest), 
                color="state", 
                title="Distribution Of " + str(interest) + ' By State Over Week')
        # set color of all traces to lightgrey
        fig.update_traces({"line":{"color":"lightgrey"}})
        # color state line to blue
        fig.update_traces(patch={"line":{"color":"blue", "width":5}}, 
                        selector={"legendgroup":list_2_state_highlighted[0]})
        # color other state line to red
        fig.update_traces(patch={"line":{"color":"red", "width":5}}, 
                        selector={"legendgroup":list_2_state_highlighted[1]})
        # remove the legend, y-axis and add a title
        fig.update_layout(title="Distribution Of " + str(interest) + ' Of ' + str(len(list_id_product)) + ' Products By State Over Week',
                        showlegend=False,
                        yaxis={"visible":False})
        
        # plot the chart
        return fig.show()

# function plot distribution of interest such as pct_access, engagement_index or page load per student by state over week with optional list id products
def plot_interest_by_state_over_week_any_products(interest, method_wanted, list_2_state_highlighted): # method is np.mean or np.median
        temp = result.copy()
        temp['time'] = pd.to_datetime(temp['time'])
        temp['week'] = temp['time'].dt.isocalendar().week
        # simple line chart with Plotly Express
        import plotly.express as px
        df_used_plot = temp.groupby(['week','state'])[str(interest)].agg(method_wanted).to_frame().reset_index()
        # sort the dataframe
        sorted_df = df_used_plot.copy() 
        # map the value order
        sorted_df["order"] = sorted_df["state"].map({list_2_state_highlighted[0]:1, list_2_state_highlighted[1]:2}).fillna(3)
        # sort by this order
        sorted_df.sort_values(by=["order","week"], ascending=False, inplace=True)

        fig = px.line(sorted_df, 
                x="week", 
                y=str(interest), 
                color="state", 
                title="Distribution Of " + str(interest) + ' By State Over Week')
        # set color of all traces to lightgrey
        fig.update_traces({"line":{"color":"lightgrey"}})
        # color state line to blue
        fig.update_traces(patch={"line":{"color":"blue", "width":5}}, 
                        selector={"legendgroup":list_2_state_highlighted[0]})
        # color other state line to red
        fig.update_traces(patch={"line":{"color":"red", "width":5}}, 
                        selector={"legendgroup":list_2_state_highlighted[1]})
        # remove the legend, y-axis and add a title
        fig.update_layout(title="Distribution Of " + str(interest) + ' Of ' + ' Any Products Including Products Not Recorded In Product Info DataFrame By State Over Week',
                        showlegend=False,
                        yaxis={"visible":False})
        
        # plot the chart
        return fig.show()


We highlight New York and Texas because these were two states which had the most and the most negligible page load per student. It is similar to the second and third charts but the most and the minor percentage of access, not page load per student.\
One thing noticed, in the second and third chart, the blue line represents North Dakota state broken in week 10, nearly the early March. I am pretty confused here, but I think missing data or lack of data recorded in that state caused that. Anyway, the general shape is similar to the chart plotted above, decreasing in the summer break and increasing in two sessions: spring and fall, even though the variation of percentage of access or page-load was considerable among states.

In [None]:
plot_interest_by_state_over_week('page_load_per_student', top_10products_id, np.mean, ['New York', 'Texas']) # highlight one the most pageloadperstudent one the least 

In [None]:
plot_interest_by_state_over_week('pct_access', top_10products_id, np.mean, ['North Dakota', 'North Carolina'] ) # highlight one the most access and another the least access
plot_interest_by_state_over_week_any_products('pct_access', np.mean, ['North Dakota', 'North Carolina'] ) # any products icluding not recorded in product_info dataframe

Well, before moving to the next perspectives, we should quickly summarize other variables in the dataset.
- About the percentage of black/Hispanic, most of the school districts that appeared in the dataset have a value from 0 to 20%. Much fewer school districts have 80 to 100% of students who are black or Hispanic. Black/Hispanic students tend to study at schools located in a suburb more than other locales. The ranges of black/Hispanic ratio are equally likely in the city. 
- About the percentage of free/reduced lunch, suburbs and cities still receive more care than rural areas because cities and suburbs have many good conditions and advantages to supply to more students. Rural free/reduced lunch has lower 60% students while some school districts in cities, suburbs, and towns have 80 to 100% students with free/reduced lunch.
- One of the advantages mentioned above, we see in the pp_total-raw chart. Suburb is a locale that receives the most per-pupil expenditure. Although rural or town also received but compared to suburb or city, the expenditure in that locale is still lower.
Another consideration is whether the difference in free/reduced lunch and expenditure is caused by race. Look at the two last charts below; the particular range of black/Hispanics ratio seems to correlate to a range of free/reduced ratio. It means school districts with a high percentage of black/Hispanic students are also high in percentage of free/reduced lunch. 
- About the expenditure, some school districts have lower than 20% black/Hispanic have both highest and lowest expenditure. So there's no obvious trend in how the government spends money on school districts that may be affected by other conditions. 

In [None]:
plt.figure(figsize=(10,8))
ax = sns.countplot(data=districts_info.sort_values(['pct_black/hispanic']).reset_index(drop=True), x='pct_black/hispanic', color='grey') # sort value 
ax.set_title('Distribution Of Pecentages Of Black/Hispanic By School District')
ax.set_xlabel('Range Of Percentages Of Black/Hispanic ')
ax.set_ylabel('Frequencies')

# set orange color for the highest bar   
patch_h = []    
for patch in ax.patches:
    reading = patch.get_height()
    patch_h.append(reading)
# patch_h contains the heights of all the patches now
idx_tallest = np.argmax(patch_h)   
# np.argmax return the index of largest value of the list
ax.patches[idx_tallest].set_facecolor('orange')  

In [None]:
# plot multiple plot about pct_black/hispanic, pct_free/reduced, county_connections_ratio and pp_total_raw
X = [ (2,3,1),(2,3,2), (2,3,3),(2,1,2)]
columns = list(districts_info)[3:]
plt.figure(figsize=(16,11))
i = 0 
for nrows, ncols, plot_number in X:
    name = columns[i]
    plt.subplot(nrows, ncols, plot_number)
    ax = sns.countplot(data=districts_info.sort_values([str(name)]).reset_index(drop=True), hue='locale', x=str(name), palette='Set3')
    ax.legend(loc='upper right')
    ax.set_ylabel('Frequencies')
    i = i+1

In [None]:
plot_distribution_based_black_hispanic('pct_free/reduced')
plot_distribution_based_black_hispanic('pp_total_raw')

The table below shows additional information about black/Hispanic related to engagement index, percentage of access, and the frequency of particular percentage of black/Hispanic in schools.

In [None]:
product_access.corr()

In [None]:
mean_pct_eng = result.groupby('pct_black/hispanic')[['engagement_index','pct_access']].agg(['mean']).reset_index()
mean_pct_eng.columns = [col[0] for col in mean_pct_eng.columns] # rebuild columns
d = {}
for ratio in ['[0, 0.2[','[0.2, 0.4[','[0.4, 0.6[','[0.6, 0.8[','[0.8, 1[']:
    value = result[result['pct_black/hispanic']== ratio]['district_id'].nunique()
    d[ratio] = value
d = pd.DataFrame.from_dict(d, orient = 'index', columns = ['freq_appear']).reset_index()
race_related_inf = pd.concat([mean_pct_eng, d], axis=1).drop(columns=['index'])

In [None]:
race_related_inf


## 2. Digital Divide In Distance Learning

**What is the digital divide?**

The digital divide is a term that refers to the gap between demographics and regions that have access to modern information and communications technology (ICT) and those that do not or have restricted access. This technology can include the telephone, television, personal computers, and internet connectivity. Historically known as the homework gap, which is similar to the digital divide, refers to students without high-speed internet and an e-learning device who could not complete assignments that required digital access.

**Why do we consider digital divide urgent matter?**

According to studies and reports, the digital divide is still very much a reality today. According to a 2019 report, approximately 5 million rural American households and 15.3 million urban or metro areas still do not access broadband internet. When the pandemic hit the USA in March 2020, the June 2020 report published by Common Sense and BCG found that 15 million to 16 million American K–12 students (about 30%) lacked adequate connectivity, an e-learning device, or both.\
Meanwhile, a study by the Pew Research Center noted that 24% of adults with household incomes below $30,000 a year do not own a smartphone, and 40% of those with lower incomes do not have home broadband services or a computer. This means that in their households, their children also cannot access digital learning.\
Digital learning threatened wholesale learning loss. Even as students return to the classroom and vaccines begin to be distributed, bridging the digital divide remains essential to reduce inequities, accelerate economic growth, and advance society as a whole. This brings more benefit because no one insists on what will happen, whether a similar pandemic appears or economic crisis, anything else. If you do not know, the digital divide negatively affects all aspects of your life, not only education. A candid look at the impacts of the digital divide will guide the direction and speed of bridging the digital gap. Some of the vivid effects of the digital divide are below:
* <span style="color:orange">*Impact of the digital divide on the education.*</span>\
Indeed, education is a very dynamic sector, and keeping up to date is crucial to success. The presence of internet access will ensure you get the latest trends and revolutionize your research skills. Today, most students do homework and hand it out to their teacher through the internet, self-learning through online courses, reading ebooks, and many other activities. And not only students but teachers also need the internet for their lectures, instructions, and management.
* <span style="color:orange">*Impact of digital divide on the economy*</span>\
Interestingly, socio-economic status is one of the major causes of the digital divide, and it is also a consequence of the digital divide. Penetration of the internet enables people to engage in economically productive activities such as trade without much hassle. Inversely, without the internet become a main invisible barrier to trade efficiently, returns get lowered, and so many things.
* <span style="color:orange">*Impact of digital divide on the social*</span>\
Internet access offers person access to a broader range of opportunities, thus creating a social divide between rich and deficient—the root of inequities, discrimination in society, especially in a multi-culture country - the USA.

<div style="text-align:center"><img src="https://www.pewresearch.org/wp-content/uploads/2021/06/ft_2021.06.22_digitaldivideincome_01.png" width="300" height="300" /></div>

**What do we get from closing the digital divide?**

<span style="color:orange">Permanently closing the digital divide is a fundamental matter of equity.</span>\
According to [Broadband Gap Quello MCU](https://quello.msu.edu/wp-content/uploads/2020/03/Broadband_Gap_Quello_Report_MSU.pdf) states that the absence of fast Internet access at home has a significant negative relationship to overall GPA and grades in English/language arts and social studies, but not in math and science. On average, students with fast home Internet access report an overall grade point average (GPA) of 3.18. This is significantly higher than the average 2.81 GPA for students with no access and the 2.75 average for students who have only cell phone Internet access. Moreover, the gap in digital skills, educational activities are also different. In disconnected students, the digital divide could significantly lead to learning loss and increase the dropout rate.

<span style="color:orange">Closing the digital divide is a future-proof, and resilient to learning.</span>\
In the near term, closing the digital divide builds resilience in our learning systems, even if vaccines do not contribute to the total population and pandemics continue. It is minimizing the loss of learning and inequities of scores, accessing materials, lectures, Etc. In the long term, it decreases the dropout rate and promotes students to unlock individualized learning pathways through gamification, adaptive learning, and asynchronous engagement. Education will be better, and equity come to all, not to someone.

<span style="color:orange">Closing the digital divide contributes to breaking the cycle of poverty.</span>\
When the gap in digital narrowed, families have more chances to access the internet, gain necessary skills for their current jobs, meaning that they could earn more money before, and their children also have good conditions for learning and growth. Then lead to receiving better in health care, service and their life will be happier.

<span style="color:orange">Closing the digital divide also contributes to minimzing a lot of money spent in long term.</span>\
In Maryland, to address the digital divide, $100 million is allocated to local school systems to ensure students access the most up-to-date devices and connectivity.\
Alabama's governor allocated 100 million in CARES Act funding for a public-private partnership to increase access to the internet for K-12 students attending school in the fall who may need internet service for distance learning.\
In New Hampshire, 50 million dollars were allocated to the Connecting New Hampshire Emergency Broadband Expansion Program.\
These are examples of money spent on the internet, broadband, Etc., every year. So it is urgent for comprehensive investment in closing the digital divide to take place nationwide.
 

So policymakers, government organizations, and technology companies must change the view of the digital divide because numbers do not lie. *Closing the digital divide is long-term scope. It is indeed much necessary and urgent today!*

**What are the root cause of the digital divide?**


Three key reasons explain the divide: lack of affordability, lack of availability, and lack of adoption. In plain English, the reasons come from infrastructure affordability, access challenges, and other barriers to adoption.
<div style="text-align:center"><img src="https://pbs.twimg.com/media/E9zLXYWUYAc0huo?format=jpg&name=medium" width="380" height="380" /></div>

<span style=" color: orange">Lack of affordability:</span> Students and families who can not afford to pay for reliable internet services, e-learning devices, so it is a significant driver of households without internet or devices. See the chart below, a quarter of home broadband users with annual household incomes ranging from \\$30,000 to just under \\$50,000 say they have had trouble doing so in the pandemic, as have roughly one-in-ten (8%) with household incomes ranging from \\$50,000 to \\$74,999 according to Pew Research Center. There are also differences in Americans' educational attainment. To understand the barriers to universal internet access, the National Telecommunications and Information Administration (NTIA) asked families their main reason for not having access to the internet. In 2019, the two most commonly cited main reasons were that they did not have home internet access. 60% did not need it or was not interested in having it, 18.7% response that it was too expensive. Other main reasons included having no or inadequate computer (3%), internet service was not available in the area (3.2%), 3% said they used the internet somewhere else, and 2% due to privacy or security concerns. According to the Organisation for Economic Co-operation and Development (OCED) data, the average cost of broadband internet connection in the US is \\$61. The US ranks second on the list of OCED countries, compared to other wealthy countries such as Germany, Japan, Finland, Denmark,...this cost is more than double. In low-income households, it is indeed an additional burden to them.
<div style="text-align:center"><img src="https://www.pewresearch.org/wp-content/uploads/2021/06/ft_2021.06.03_broadband_01a.png" width="380" height="380" /></div>



In [None]:
nita_gov_data = pd.read_csv(r'/kaggle/input/ntiagov-dataset/ntia_analyze_2020.csv') #data set get on https://www.ntia.gov/files/ntia/publications/ntia-analyze-table_2020-05-15.csv

In [None]:
list = ['canUseElsewhereMainReason','noNeedInterestMainReason','noComputerMainReason','unavailableMainReason','privSecMainReason','tooExpensiveMainReason']
not_internet_reason = nita_gov_data[nita_gov_data['variable'].isin(list)][['dataset','usProp','description']] # extract data realated to main reason not internet
not_internet_reason['usProp'] = not_internet_reason['usProp']*100 #convert to percentage
not_internet_reason['description'] = not_internet_reason['description'].apply(lambda x:x.replace("Main Reason for Household Not Online at Home: ","")) # convert string for readable

In [None]:
import plotly.express as px
fig = px.line(not_internet_reason, x='dataset', y='usProp', color='description')
fig.update_layout(
    title="Main Reason For Not Having Access",
    xaxis_title="Time",
    yaxis_title="Percent",
    legend_title="Main Reason",
    font=dict(
        family="Courier New, monospace",
        size=15,
        color="Black"
    )
)
fig.show()

<span style="color:orange">Lack of availability:</span> There is insufficient coverage to deliver wired or wireless broadband service, or where there is poor service quality (e.g., speed and reliability) in which students live. Especially in rural areas, many students and households lack internet connections because their communities lack local broadband infrastructure. This physical infrastructure problem results in specific communities not having the ability to access broadband at industry-standard speeds regardless of their desire or financial resources. According to tech-giant [Microsoft](https://www.nytimes.com/2018/12/04/technology/digital-divide-us-fcc-microsoft.html), more than 157 million people were not using the internet at broadband speeds as of 2019, while nearly 97% population said they use the internet. View the [US Investment Gaps By 2040](https://www.cfr.org/backgrounder/state-us-infrastructure), experts warn of the “broadband gap,” in which rural and low-income communities suffer from a lack of infrastructure to deliver reliable, fast internet, referred to as broadband. A 2020 Federal Communications Commission report finds that some 18 million Americans, most of whom live in rural areas, lack access to any broadband network. Other estimates suggest that more than twice as many people lack access. Governors from both major parties identify internet access as a priority in their states and propose plans costing tens of millions of dollars. This issue disproportionately affects remote learning for students in rural communities where access to fixed broadband is limited. Moreover, according to the [Federal Communications 
Commission’s 2019 Broadband Deployment
Report](https://www.fcc.gov/reports-research/reports/broadband-progress-reports/2019-broadband-deployment-report), 21.3 million Americans lack access to a connection that enables a download rate of at least 25 megabits per second and an upload rate of 3 Mbps; speeds that are considered to be the industry standard. This includes students of all levels, including both K-12 students and higher education students. 

<div style="text-align:center"><img src="https://broadbandnow.com/app/uploads/2020/03/access-to-terrestrial-broadband2.png" width="450" height="450" /></div>
<div style="text-align:center"><img src="https://broadbandnow.com/app/uploads/2020/03/fastest-speed.png" width="450" height="450" /></div>


<span style="color:orange">Lack of adoption:</span>: Many students and households are unable to access the internet because they do not have an adequate device to connect to the internet. According to 2018 [American Community Survey](https://futureready.org/wp-content/uploads/2020/08/HomeworkGap_FINAL8.06.2020.pdf) data, approximately 3.6 million households, including 7.3 million children in the United States, do not have a laptop, desktop, or tablet with which to connect to the internet. The lack of a laptop or desktop computer to complete schoolwork is a barrier to full participation in remote learning. Among children ages 3-18, 17% live in households without a laptop or desktop computer. At least 11 million students do not have a computer for online learning at all, in addition to those that may need to share a single device with siblings. This is not including surveys about devices quality for distance learning. In other cases,  despite living in a place where access is available and affordable, students may be disconnected due to a wide range of adoption barriers such as insufficient digital skills, language barriers, discomfort with providing personal data, family mobility, or lack of interest. 

**Digitally Divide Student Segment**

The difference in the opportunity to access the internet or devices results in student segment digitally divided into three main groups: entire connected/devices or both, complete disconnected/unavailable devices or both, connect insufficient/device inefficiently or both.

<span style="color:orange">Full connected or available devices or both:</span> Defined as students with either distance learning devices or adequate connectivity, or both. According to the table from [U.S. Census Bureau Household Pulse Survey](https://www.census.gov/data/tables/2020/demo/hhp/hhp5.html) below, in 6,3 million students less than high school, nearly 53% has available devices or adequate connectivity for distance learning, the figures are 65% of 17 million students in high school or GED. 

<span style="color:orange">Full disconnected or unavailable devices or both:</span> Defined as students with neither distance learning devices or adequate connectivity, or even both. The segment of least connected students is the smallest but also one of the important segments to address, including students who have no high-speed internet and no device in their household. This results in students losing their learning permanently, even increasing the dropout rate. Nearly 4.2% of 6.3 million students in less than high school (200-300 thousand students) and 2% of 17 million students in high school or GED (300-400 thousand students)

<span style="color:orange">Devices deficient or Connect inefficient or both:</span> Students with either devices or connectivity but usually, sometimes, or rarely available to use categorized inefficient,  or even both of them coincide. Not having appropriate devices such as laptops, computers, Etc for distance learning or not using borrowed devices cause learning inefficient & disrupt their progress. Also, unreliable internet such as low speed, not available in-home causes similar things. In less than high school students, 2-3 million students said they were in devices deficient, while that is 5-6 million students in high school or GED.

In [None]:
segment = pd.read_csv(r'/kaggle/input/student-segment/student_segment.csv').dropna() # dataset source: U.S. Census Bureau Household Pulse Survey, Week 5.
list_segment_columns = segment.columns.values.tolist()[1:]
for col in list_segment_columns:
    segment[col] = segment[col].apply(lambda x: x.replace(",","")).astype('int64')  # reformat and convert to int

# usually & rarely & sometime --> insuficient, never & always --> full
segment["avail_device"] = segment['Device always available for educational purposes'] 
segment['full_connect'] = segment['Internet always available for educational purposes']  
segment["unavail_device"] = segment['Device never available for educational purposes'] 
segment['full_disconnect'] = segment['Internet never available for educational purposes'] 
segment["device_deficient"] = segment['Device sometimes available for educational purposes'] + segment['Device usually available for educational purposes'] + segment['Device rarely available for educational purposes']
segment['connect_deficient'] = segment['Internet sometimes available for educational purposes'] + segment['Internet usually available for educational purposes'] + segment['Internet rarely available for educational purposes']
segment.drop(columns=segment.columns.values.tolist()[2:12],inplace =True)
# display
segment

**Other Problems Should Be Taken Into Consideration**

<span style="color:orange">Teacher gap in digital:</span>  A new survey of teachers during the pandemic carried out by VeraQuest Research, LLC shows significant disparities in how confident teachers felt about using these resources. Only 66% of teachers reported being very or extremely confident in using digital media services for teaching after the pandemic-prompted shift to remote learning. This lack of confidence may in part be associated with the finding that nearly one in seven teachers (13%) had not previously used these services, reporting they started using K-12 digital media services only after the COVID-19-related school closures. In general, the more digital services a teacher was using, the higher their confidence level in using K-12 digital media in their teaching.

<span style="color:orange">Trend impacting the distance learning in 2020 :</span> In addition to figures collected and showed above, to get more knowledge of other perspectives around the digital divide, we explore vital underlying trends that occurred across the U.S. in 2020. Based on some analysis reports or mass media, there are five key trends considered.
- First, unprecedented unemployment rates are forcing many families that were previously in the middle class (i.e., not
qualified for free and reduced lunch) to require services and support to meet basic needs, including food security. Based on a [U.S. Census Bureau survey conducted from July 2 to July 7, 2020](https://www.census.gov/data-tools/demo/hhp/#/?measures=HIR), nearly 43.4 million Americans – or 25.3% of the adult population – either missed last month’s rent or mortgage payment or have little to no confidence that they can pay next month’s rent or mortgage on time. When the Keep Americans Connected Pledge expires on June 30, many families will need to make difficult financial trade-offs, including becoming delinquent on or opting out of household internet service due to these economic challenges.
- Second, social distancing, self home quarantine under COVID-19 make internet connectivity essential to safely stay in touch with friends and family, work from home, apply for jobs, and keep up with critical developments. Before the pandemic, schools were the bridge between students with and without internet at home. Students who had previously relied on public libraries, schools, and public Wi-Fi in cafés and restaurants that are now closed or limiting patrons find that having access to the internet at home has become increasingly critical.
- Third, 37 million total cases, 680 thousand total death, the U.S. ranked first in Covid map worldwide. Many families and students feared being diagnosed with Covid-19 and lost their families; even someone faces depression extremely.
- Fourth, according to new data collected [Legal Templates, a company that provides legal documents](https://legaltemplates.net/resources/personal-family/divorce-rates-covid-19/#divorces-increase-in-couples-with-children), the number of people looking for divorces was 34% higher from March through June compared to 2019, 31% of the couples admitted lockdown had caused irreparable damage to their relationships. The combination of stress, unemployment, financial strain, death of loved ones, illness, homeschooling children, mental illnesses, and more has significantly strained relationships.
- Fifth, there have been significant, swift efforts by districts, governments, the private sector, and philanthropy across the
The United States to provide devices and connectivity to students since March 2020. These efforts have certainly reduced the existing digital gaps among students.
- Sixth, Covid-19 prompted the government to send Americans unprecedented financial help. President Joe Biden signed \\$1.9 trillion relief bill which plans to send direct payments of up to \\$1,400 to most Americans, extends a \\$300 per week unemployment insurance supplement, expands the child tax credit, and puts funds into vaccine distribution.

These supply and demand trends will undoubtedly have different and opposing impacts on the size of the digital divide in 2020, and it is too early to understand how they will change the size and nature of the divide because there is a lack of data for explaining quantitatively. However, they are critical and helpful to observe and understand the drivers and size of this gap for what happened around us in real life. 

In [None]:
%%HTML
<iframe width=560 height=349 src=https://player.cnbc.com/p/gZWlPC/cnbc_global?playertype=synd&byGuid=7000181309 frameborder=0 scrolling=no allowfullscreen webkitallowfullscreen mozallowfullscreen oallowfullscreen msallowfullscreen ></iframe>

In [None]:
unemploy_rate = pd.read_csv(r'/kaggle/input/unemployment-rate/unemployment_rate.csv', header = 1) # source: U.S. Bureau of Labor Statistics, Current Population Survey.
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(x=unemploy_rate['Quarter'], y=unemploy_rate['Labor force participation rate'],
                    mode='lines',
                    name='Labor force participation rate'))
fig.add_trace(go.Scatter(x=unemploy_rate['Quarter'], y=unemploy_rate['Employment–population ratio'],
                    mode='lines',
                    name='Employment–population ratio'))
# Edit the layout
fig.update_layout(title='Labor force participation rate and employment–population ratio, quarterly averages, seasonally adjusted, 2000–20',
                   xaxis_title='Quarter',
                   yaxis_title='Percent')
fig.show()

## 3. The difference of digital divide in certain demographics

<span style="color:orange">Household income</span>\
According to a survey carried out by U.S. Census Bureau Household Pulse Survey, Week 5. The digital divide affects students of all income levels, but students from lower-income homes are most likely to be disconnected, not have proper devices, or even both conditions. Students from families with annual household incomes of less than \$25,000 without internet or proper devices or even both are approximately 450 thousand which is three times as many as middle income and the figures also much higher compared to high-income. 


In [None]:
by_income = pd.read_csv(r'/kaggle/input/by-income/by_income.csv') # read file csv
# clean, reformat and pre-procssing
by_income['Household income ']=by_income['Household income '].apply(lambda x : x.strip(" "))
by_income['avail_device/internet/both'] = (by_income['Device always available for educational purposes']+by_income['Internet always available for educational purposes'])/2
by_income['ineffiecient_device/internet/both'] = (by_income['Device usually available for educational purposes']+by_income['Device sometimes available for educational purposes']+by_income['Device rarely available for educational purposes']
+by_income['Internet usually available for educational purposes']+by_income['Internet sometimes available for educational purposes']+by_income['Internet rarely available for educational purposes'])/6
by_income['unavail_device/internet/both'] = (by_income['Device never available for educational purposes']+by_income['Internet never available for educational purposes'])/2
by_income.drop(columns=by_income.columns.values.tolist()[2:14],inplace=True)
by_income['Household income '] = by_income['Household income '].apply(lambda x:x.replace("$",""))

#plot bar chart
import plotly.graph_objects as go

fig = go.Figure()

fig.add_trace(go.Bar(
    y=by_income['Household income '],
    x=by_income['avail_device/internet/both'],
    name='Available Device/Internet or Both',
    orientation='h',
    marker=dict(
        color='rgba(246, 78, 139, 0.6)',
        line=dict(color='rgba(246, 78, 139, 1.0)', width=1))))

fig.add_trace(go.Bar(
    y=by_income['Household income '],
    x=by_income['ineffiecient_device/internet/both'],
    name='Inefficient Device/Internet or Both',
    orientation='h',
    marker=dict(
        color='rgba(18, 36, 97, 0.6)',
        line=dict(color='rgba(18, 36, 97, 1.0)', width=1))))
fig.add_trace(go.Bar(
    y=by_income['Household income '],
    x=by_income['unavail_device/internet/both'],
    name='Unavailable Device/Internet or Both',
    orientation='h',
    marker=dict(
        color='rgba(58, 71, 80, 0.6)',
        line=dict(color='rgba(58, 71, 80, 1.0)', width=1))))

fig.update_layout(barmode='stack',yaxis_title="Dollar ($)", title="Digital Divide By Income Level In Week 5, 2020")
fig.show()

<span style="color:orange">Geography</span>\
The divide affects students in all 50 states. The [interactive map](https://www.commonsensemedia.org/digital-divide-stories#/state) shows the student digital divide across all 50 states. In Mississippi-the most enormous digital divides state, approximately 50% of students lack adequate internet. Even in states with small divides such as New Jersey, Connecticut, Massachusetts, as many as 1 in 4 students still lack adequate internet. In the U.S.’s most populous states: Texas, California, and Florida, there seem to be quite many students without reliable internet and proper devices. Even higher in rural southern states, including Mississippi, Alabama, Arkansas, and Oklahoma.

<span style="color:orange">Density</span>\
The divide occurs in urban, suburban, and rural.  According to a Pew Research Center survey conducted earlier this year.  24% of rural adults say access to high-speed internet is a major problem in their local community, compared with 13% of urban adults and 9% of rural adults. Similar rates of concern about access to high-speed internet were shared by rural adults in both lower- and higher-income households, as well as by those with various levels of educational attainment. When looking at differences by community type in technology ownership, rural adults are less likely than urban adults to own traditional or tablet computers.

<div style="text-align:center"><img src="https://www.pewresearch.org/wp-content/uploads/2021/08/FT_21.06.04_RuralBroadband.png" width="450" height="450" /></div>
<div style="text-align:center"><img src="https://assets.pewresearch.org/wp-content/uploads/sites/1/2018/09/FT_18.09.07_RuralInternet_roughly-one-in-four.png" width="450" height="450" /></div>

<span style="color:orange">Race/Ethnic</span>\
For both internet and computers, White and Asian children have a higher than average rate of internet access, whereas Black, Hispanic have nearly average rate and American Indian and Native Alaskan children have lower than that. Internet access is deficient for American Indian and Native Alaskan children, with 65% access to a computer and 63% home internet. Another report shows that although White alone (not Hispanic) has internet access or computers or both higher than other races and without internet or computer access is high too. Asian alone (not Hispanic) is the lowest rate without internet or computer or both. 

<div style="text-align:center"><img src="https://staticweb.usafacts.org/media/images/Tech_access_by_race_uVEEMGh.width-1200.png" width="450" height="450" /></div>


In [None]:
by_race = pd.read_csv(r'/kaggle/input/by-race/by_race.csv') # read file csv
# clean, reformat and pre-procssing
by_race['Hispanic origin and Race ']=by_race['Hispanic origin and Race '].apply(lambda x : x.strip(" "))
by_race['avail_device/internet/both'] = (by_race['Device always available for educational purposes']+by_race['Internet always available for educational purposes'])/2
by_race['ineffiecient_device/internet/both'] = (by_race['Device usually available for educational purposes']+by_race['Device sometimes available for educational purposes']+by_race['Device rarely available for educational purposes']
+by_race['Internet usually available for educational purposes']+by_race['Internet sometimes available for educational purposes']+by_race['Internet rarely available for educational purposes'])/6
by_race['unavail_device/internet/both'] = (by_race['Device never available for educational purposes']+by_race['Internet never available for educational purposes'])/2
by_race.drop(columns=by_race.columns.values.tolist()[2:14],inplace=True)


#plot bar chart
import plotly.graph_objects as go

fig = go.Figure()

fig.add_trace(go.Bar(
    y=by_race['Hispanic origin and Race '],
    x=by_race['avail_device/internet/both'],
    name='Available Device/Internet or Both',
    orientation='h',
    marker=dict(
        color='rgba(246, 78, 139, 0.6)',
        line=dict(color='rgba(246, 78, 139, 1.0)', width=1))))

fig.add_trace(go.Bar(
    y=by_race['Hispanic origin and Race '],
    x=by_race['ineffiecient_device/internet/both'],
    name='Inefficient Device/Internet or Both',
    orientation='h',
    marker=dict(
        color='rgba(18, 36, 97, 0.6)',
        line=dict(color='rgba(18, 36, 97, 1.0)', width=1))))
fig.add_trace(go.Bar(
    y=by_race['Hispanic origin and Race '],
    x=by_race['unavail_device/internet/both'],
    name='Unavailable Device/Internet or Both',
    orientation='h',
    marker=dict(
        color='rgba(58, 71, 80, 0.6)',
        line=dict(color='rgba(58, 71, 80, 1.0)', width=1))))

fig.update_layout(barmode='stack',yaxis_title="Race", title="Digital Divide By Race/Ethnic In Week 5, 2020")
fig.show()

## 4. Technology Requirements For Distance Learning (Reference)

Four things should be considered for a robust distance learning experience:
* High-speed internet (reliable)
* Proper devices for distance learning
* Distance learning lecture content
* Support 

<span style="color:orange">High-speed internet</span>\
Internet service availability and quality of service can be a challenge; learners must be cautious when selecting an Internet service provider. Many Internet services are promoting their services as being “high speed” when they are no faster than a low-end DSL connection. Online learners should have at least a 1.5 Mbps Internet speed as a minimum requirement (both upload and download speed), or visit [here](https://www.sde.idaho.gov/re-opening/files/connectivity/Internet-Needs-to-support-remote-learning.pdf) for more details. To determine your Internet speed, you can use various Internet speed tests such as [Speedtest](http://www.speedtest.net).

Pros & cons in each type of internet:
- Fiber internet is the fastest on the market but has limited availability and is often the most expensive.
- Cable internet is fast and widely available, making it a great option for most customers, but it may be limited in rural areas.
- DSL internet operates through telephone networks, making it accessible to rural customers, but provides slower maximum speeds than cable.
- Satellite internet also connects to remote locations but suffers from spotty service and notoriously slow speeds.

|      Type      |  Fiber internet | Cable internet | DSL internet    | Satellite internet | Mobile Internet |
| -------------  |:---------------:| --------------:| ---------------:| ------------------:| ---------------:|
| Pros           | Fast, reliable  | Widely available, good speed & cost | Widely available, reasonable cost | Accessible | Convenience, flexible |
| Cons           | Expensive, not widely available |Lower accessibility in rural and remote areas| Medium speed, vulnerable | Slower speed, expensive | Data limited, expensive |

<span style="color:orange">Proper devices for distance learning</span>\
In order to join in distance learning, students and teachers need suitable devices, including laptops and tablets. Although mobile phones are helpful learning supplements, they are not appropriate for completing, submitting assignments, and study through virtual class a long time; many education platforms are also not optimized for mobile devices.\
The appropriate device will depend on the connectivity solution available. With fixed broadband, suitable devices will include traditional laptops and tablets with built-in Wi-Fi, with no additional hardware requirements. A cellular network (4G or 5G) is the option, students and teachers will need LTE-enabled laptops or tablets or a traditional laptop or tablet plus a mobile hotspot device. Additionally, learners are encouraged to have a PC, laptops or tablets, and notebooks no more than a few years old to ensure compatibility and optimal performance (e.g., PC should have CPU Intel i3 or above, 6GB ram or above, and 500GB memory...)

<span style="color:orange">Instructional content for distance learning</span>\
Many platforms rely on real-time applications like Zoom, Meet to engage directly with students as a substitute for the in-classroom experience. They are important tools for teachers to provide engagement with classmates, as well as
1-to-1 attention and support. Interesting & creative content are encouraged to prevent boring from students. Teachers must consider alternative instructional content, tools, and assignments that lower internet speed or have not too strong hardware requirements.

<span style="color:orange">Technical support for distance learning</span>\
Quality technical support is required as users activate, build a knowledge base, and troubleshoot issues with student or teacher's connectivity, devices, and tools. Digital literacy skills are a necessary pathway to bridging the homework gap. Students need support in developing the skills to take advantage of the opportunities enabled by internet connections and devices. Short-term training courses about technology skills for distance learning should be encouraged.




## 5. How To Close Digital Divide In The U.S.

To permanently close the digital divide, solutions must address each of the root causes presented above. 

<span style="color:orange">Avaiability</span>\
In rural areas, many students and households lack internet connections because their communities lack local broadband infrastructure. This physical infrastructure problem results in certain communities not having the ability, regardless of their desire or financial resources, to access broadband at industry-standard speeds. Permanently, closing the digital divide will only succeed if every household has a robust broadband connection. Policymakers should **promote program about widening internet network which can be accessed by everyone**. In a rural area where traditional internet, such as cable internet, cannot reach, the government may negotiate with these companies to provide satellite internet service like Starlink (Elon Musk). In old internet infrastructure, policymakers should ensure that government funding is used to upgrade and deploy broadband infrastructure that meets current established needs, at least for distance learning or even higher speed for other activities from home. **Researching new breakthrough technology in internet connection** need to be promoted and focused more. Federal and state policy should **launch policies for attracting telecommunications companies investment on internet services & infrastructure**. It does not only provide resident opportunity to access the internet, but help resident has more choice as well as expand the competition among companies which motivate them to improve their services more quality and lower cost. Recently, the plan which would devote **\$100 billion in \\$2.3 trillion infrastructure package to get all Americans connected has passed in The House**. It would spend \$100 billion to “future-proof” broadband as part of an eight-year infrastructure plan, calling high-speed connections “the new electricity” that’s now a necessity for all Americans. How is it effective? Americans could hope for promised future.
In addition to the internet, proper devices for learning are also necessary too. 

<span style="color:orange">Affordable</span>\
When local broadband infrastructure is available, the next problem is that students or households will not be able to access the internet at home because they cannot afford it. As presented in sections before, the cost of the internet in the U.S. is much higher than in other high-income countries, although the average internet speed is not too impressive. Clearly, most of American is paying a higher cost for internet than others, month-by-month its cost indeed is high. While the poverty rate in the U.S. surged from 9.3% in June to 11.7% in November, due to the coronavirus pandemic’s decimation of the labor market and the months-long expiration of benefits, according to a [report](https://harris.uchicago.edu/files/monthly_poverty_rates_updated_thru_november_2020_final.pdf) released Wednesday by analysts at the University of Chicago and the University of Notre Dame, creating the most considerable increase in a single year since the government began tracking poverty in 1960. So the solution to this problem can be thought of **decreasing the cost of internet**. Policymakers and government should **hold a meeting among telecommunication companies to talk about customer care and purchase preference policy**. The priority of goal is to **drive customer benefits policies toward to the most vulnerable** which may help them break out suffering poor living conditions, typically not afford to the internet.
Simultaneously, they should commit to **funding cost-support programs** that will cover student connectivity and device costs and ensure that all federal and state broadband programs allow for **transparency in pricing and encourage bulk-purchasing efforts** by states and districts. States and school systems will also need funding to support outreach for and **raise awareness of low-cost broadband service offerings and broadband service cost-support programs**.

<span style="color:orange">Adoption</span>\
Due to the impact of Covid-19, supply chain problems and delivery delays mean a vast, remote learning experiment still has not touched many American students as learning gaps open broader by the day. Only 24% of public school teachers reported that all of their students had access to a computer or tablet to use for schoolwork, according to a nationally representative survey of 600 public school teachers conducted in early May by Educators for Excellence (E4E). In the short-term, to solve the gap during Covid-19, governors must be committed to **creating the most favorable conditions for maintaining necessary supply chain** such as supply computers, tablets for ensuring students have sufficient conditions to continue their learning. In the long-term, because manufacturing industries, especially in the technology industry, face difficulties due to shortage of chips in global, so **government funds should be invested more in that industry and considering chip manufacturing industry is a basis which other manufacturing rely on** including computers, tablet, laptops manufacturing industry. Due to a wide range of other adoption barriers, these efforts should need to **build digital literacy included in curriculum early and inclusion skills**, which may increase trust in technology solutions and understand the importance of digital. Also, designing solutions to **address distinct student needs** is urgent such as multilingual support materials, video training, support groups, or chat which make students more confident and safer in using digital devices & the internet. Districts, community-based organizations, providers,... **play an essential role in helping address parent's potential lack of trust and skepticism** of technology solutions.

>In addition to what was suggested above, imposing any policies must be based on each particular case. Because the U.S. has 50  states and each state has different data about demographics, infrastructure, Etc. It is impossible to use one state's policies and apply them to some others, right? The goal of this report is not to force policymakers or government must do what is mentioned above in a detailed manner but to navigate approaches and suggest specific solutions based on the natural root causes. This report encourages and welcomes any policies or actions which are creative, suitable, and necessary toward closing the digital divide gap. 

# Conclusion

The digital divide may be the term we remember so much through this report. The digital divide has become an increasingly important element of modern education and even urgent problems in the US. Additionally, digital divides disproportionately affect specific student populations: Native American, Black, Asian, White and Hispanic students; members of low-income households; and students living in rural areas are less likely to have the internet access they need to participate fully in the education. The lack of availability, adoption, and affordability are three main critical reasons for without internet or adequate devices in students, which lead to divide in student segment and widen digital or learning gap among students. Along with the impact of the Covid-19 pandemic, everything will be worse than before. The necessary actions in the short term are to help students not lose their learning progress and overcome the pandemic. In the long-term, moving forward to impose breakthrough and creative policies to close the digital gap and make American's living conditions better.