Author | Date 
:--------: | ------- 
Miquel, Elisabeth | 2021-09-30
<h1 style="font-family:Arial; text-align:center; font-size:36pt"><br>How has the Covid-19 pandemic impacted traditional schooling? </h1>


## Abstract: 
<div style="text-align: justify">This paper outlines the educational consequences of the policies adopted in response to the Covid-19 pandemic situation in the USA. We will develop different analyses based on race, reduction in food allowances, online connectivity, where students live, whether the state is Democratic or Republican, time periods (holidays and vacation and school holidays), and poverty rates in each state. 
We have found some very nourishing findings for future study: 
- The sponsors have used a sample of 0.26% of US schools (2019-2020).
- Economic poverty, race, location and internet connectivity are the four key factors in a child's determination to advance in school without the frustration created by the social divide. 
- Arizona and North Dakota are the states with the highest pct_acces and engagement_index. 
- In the fall semester of the 2020-2021 academic year, lunch reduction demands increased dramatically (68%) from the previous year. 
- Democratic states implemented an emergency plan to address the non-transmission of Covid-19 than Republican states (Arizona is the exception as it is a pro-online state). There are countries in Spain where this technique is widely used).</div>

## Introduction:
<div style="text-align: justify">The Covid-19 pandemic has negatively affected the lives of people on Earth. The economy, education, the welfare state, politics and the health care system have been altered and suffocated by the pandemic. 
The first American case was reported on January 20, and President Donald Trump declared the U.S. outbreak a public health emergency on January 31. Restrictions were placed on flights arriving from China,but the initial U.S. response to the pandemic was otherwise slow, in terms of preparing the healthcare system, stopping other travel, and testing.Meanwhile, Trump remained optimistic and was accused by his critics of underestimating the severity of the virus.
From that moment on, President Donald Trump instituted secular policies based on the ideology that democratic states would be the ones affected by the pandemic, because official reports announced that people of colour, indigenous and Hispanic people were the most affected by the pandemic, causing large outbreaks in Seattle and New York (NY). This campaign was fatal to the implementation of containment measures to stop the spread of the virus in the population. However, thanks to the tenth amendment to the US constitution, democratic states began to put in place health and public protection policies before the pandemic was officially declared by the OMS on 2020-03-11. In Washington, for example, schools and universities began to close on 29 February. This was followed by California and NY on 4 and 7 March respectively. This influences the population of each state differently, causing a contrast between states, and a unified chaos. The impact of Covid-19 has been analysed in healthcare, economics, politics, but what about the education of children? To stop the spread of Covid-19, one of the basic regulations was to close schools to stop contagion. Traditional (face-to-face) education was severely affected and a distance and/or online system had to be quickly put in place to minimise the impact of the education and thus be able to continue their education. The system was not ready, and in many states, a network for online education was not in place. Applications were created and, thanks to various platforms, it was possible to continue the education of children. But at this point new questions arise:
Do all households have broadband internet, do all households have devices to connect to the internet, does living in the city and in rural areas have an influence? All these questions are the consequence of one irrefutable fact: the social-economic gap will be further accentuated because traditional state-provided education will be directly dependent on the economy of each household. This will stifle the chances of passing courses and will stress family coexistence, especially for the most disadvantaged. Another part included in this reflection will be the aid that parents request from the states for the reduction of the school lunch. In order to answer all these questions we will use three sets of datasets provided by the sponsors of the contest and additional material researched by myself creating additional datasets to complete the information.</div>

## Data Preparation:
<div style="text-align: justify">To analyse the analysis of the change in the learning situation of our students, we have three main datasets: products.csv, district.csv and engagement dataset. Let's look at them all step by step. Let's get started!

In [None]:
# import datasets:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
!pip install pdpipe

In [None]:
import datetime as dt
import pdpipe as pdp
from typing import Tuple, List, Dict
import glob

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.offline
import plotly.graph_objs as go

import re

In [None]:
us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District Of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}


### A. Products:
<div style="text-align: justify"> The product file products_info.csv includes information about the characteristics of the top 372 products with most users in 2020. Here is the basic information:

In [None]:
products = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')
print('General Information: ')
print()
print('Shape :',products.shape)
print('Information: \n')
products.info()

* __Variables description:__

Variable | Description | Missing value | Dtype
:--------: | :-------: | :--------: | -------
LP ID | The unique identifier of the product. | 0 | int64
URL | Web Link to the specific product. | 0 | object
Product Name | Name of the specific product. | 0 | object
Provider/Company Name | Name of the product provider. | 1 | object
Sector(s) | Sector of education where the product is used. | 20 | object
Primary Essential Function | The basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled. | 20 | object

* __Treatment for missing values:__

In order to simplify the information in the simplest and most convenient way we have made these tables, which have information about the Company, the most used product, the number of times it appears in the dataset and the number of missing values. The aim of doing it this way is to waste the least amount of information without changing the overall statistics of each dataset. 

__1. Provider/Company Name variable:__

In order to deal with the missing values of this variable, we first look at which row the NaN value for the variable Company Name is located. We see that it is the company `True North Logic`. It only appears once in the dataset, so we decide to remove it.

In [None]:
products[products['Provider/Company Name'].isnull()]

In [None]:
products[products['Product Name']=='True North Logic']

In [None]:
products.drop([371],axis=0, inplace=True)

   __2. Sector(s) variable:__
    
The variable Sector(s) is made up of 5 sub-categories: 'PreK-12', 'PreK-12; Higher Ed', 'PreK-12; Higher Ed; Corporate', 'Corporate', 'Higher Ed; Corporate', nan. We can see that the value NaN appears as well. Let's see which Companies have missing values in their Sector. The Sector values of 19 Products with their respective Companies are missing. Products are unique but Companies can be repeated, as a Company can offer different products to its customers. An example is the Company `Google LLC` which is repeated 3 times.

In [None]:
products[products['Sector(s)'].isnull()]

One way to deal with missing values is by those where Companies are repeated and those that only appear once in the original dataset. Those that only appear once in the original dataset will be removed because there is no way to find the Sector. For example: `Microsoft` appears 6 times, of which two of them have missing values in the Sector variable. What is done is to look at which Sector of the products offered by `Microsoft` is the predominant one and replace the missing values by the predominant value. In this way we do not affect the statistics. 
The companies that only appear once in the entire original dataset are:
`Yelp, Inc, Lea(R)n, Genius Media Group, U.S. News & World Report, L.P., CBS Interactive, Safe YouTube`.

In [None]:
products = products.drop([146,158,174,311,331,354], axis=0)

We will now treat the missing values of the repeating companies in the original dataset with the criteria mentioned above:

* __2.1 IXL Learning__:
    
Company | # times in dataset | Proportions | # Missing value
:--------: | :--------: | :--------: | :--------:
IXL Learning | 4 | PreK-12 (100%)| 1 

In [None]:
products[(products['Provider/Company Name']=='IXL Learning')]['Sector(s)'].value_counts(normalize=True)
products[(products['Provider/Company Name']=='IXL Learning')]['Sector(s)'].notnull()
products.loc[61,'Sector(s)'] = 'PreK-12'

* __2.2 Microsoft:__
    
Company | # times in dataset | Proportions | # Missing value
:--------: | :--------: | :--------: | :--------:
Microsoft | 6 | PreK-12; Higher Ed; Corporate (75%) PreK-12; Higher Ed (25%)| 2   

In [None]:
# What is the dominant sector at Microsoft?
products[(products['Provider/Company Name']=='Microsoft')]['Sector(s)'].value_counts(normalize=True)
# What are the completed and missing values?
products[(products['Provider/Company Name']=='Microsoft')]['Sector(s)'].notnull()
# Substitution of missing values by the value of the predominant sector:
products.loc[183,'Sector(s)'] = 'PreK-12; Higher Ed; Corporate'
products.loc[293,'Sector(s)'] = 'PreK-12; Higher Ed; Corporate'

* __2.3 Houghton Mifflin Harcourt:__
    
Company | # times in dataset | Proportions | # Missing value
:--------: | :--------: | :--------: | :--------:
Houghton Mifflin Harcourt | 6 | PreK-12 (60%) PreK-12; Higher Ed (40%)| 1  

In [None]:
# What is the dominant sector?
products[(products['Provider/Company Name']=='Houghton Mifflin Harcourt')]['Sector(s)'].value_counts(normalize=True)
# What are the completed and missing values?
products[(products['Provider/Company Name']=='Houghton Mifflin Harcourt')]['Sector(s)'].notnull()
# Substitution of missing values by the value of the predominant sector:
products.loc[210,'Sector(s)'] = 'PreK-12'

* __2.4 ClassDojo, Inc.__:
    
Company | # times in dataset | Proportions | # Missing value
:--------: | :--------: | :--------: | :--------:
ClassDojo, Inc. | 2 | PreK-12 (100%) | 1  

In [None]:
# What is the dominant sector?
products[(products['Provider/Company Name']=='ClassDojo, Inc.')]['Sector(s)'].value_counts(normalize=True)
# What are the completed and missing values?
products[(products['Provider/Company Name']=='ClassDojo, Inc.')]['Sector(s)'].notnull()
# Substitution of missing values by the value of the predominant sector:
c = products[products['Provider/Company Name']=='ClassDojo, Inc.'].reset_index()
for i in c['index']:
    products.loc[i,'Sector(s)'] = 'PreK-12'

* __2.5 Google LLC:__
    
Company | # times in dataset | Proportions | # Missing value
:--------: | :--------: | :--------: | :--------:
Google LLC | 30 | PreK-12; Higher Ed; Corporate (85.19%)  PreK-12; Higher Ed (7.41%) PreK-12 (7.41%) | 3  

In [None]:
# What is the dominant sector?
products[(products['Provider/Company Name']=='Google LLC')]['Sector(s)'].value_counts(normalize=True)
# What are the completed and missing values?
products[(products['Provider/Company Name']=='Google LLC')]['Sector(s)'].notnull().shape
# Substitution of missing values by the value of the predominant sector:
d = products[products['Provider/Company Name']=='Google LLC'].reset_index()
for i in d['index']:
    products.loc[i,'Sector(s)'] = 'PreK-12; Higher Ed; Corporate'

- __2.6 Adobe Inc.:__
    
Company | # times in dataset | Proportions | # Missing value
:--------: | :--------: | :--------: | :--------:
Adobe Inc. | 3 | PreK-12; Higher Ed; Corporate (100%) | 1 

In [None]:
products[(products['Provider/Company Name']=='Adobe Inc.')]['Sector(s)'].value_counts(normalize=True)
products[(products['Provider/Company Name']=='Adobe Inc.')]['Sector(s)'].isnull()
d = products[products['Provider/Company Name']=='Adobe Inc.'].reset_index()
for i in d['index']:
    products.loc[i,'Sector(s)'] = 'PreK-12; Higher Ed; Corporate'

* __2.7 Grammarly:__
    
Company | # times in dataset | Proportions | # Missing value
:--------: | :--------: | :--------: | :--------:
Grammarly | 2 | PreK-12; Higher Ed; Corporate (100%) | 1  

In [None]:
products[(products['Provider/Company Name']=='Grammarly')]['Sector(s)'].value_counts(normalize=True)
products[(products['Provider/Company Name']=='Grammarly')]['Sector(s)'].isnull()
products.loc[314, 'Sector(s)'] = 'PreK-12; Higher Ed; Corporate'

* __2.8 Technological Solutions, Inc. (TSI):__
   
Company | # times in dataset | Proportions | # Missing value
:--------: | :--------: | :--------: | :--------:
TSI | 2 | PreK-12 (100%) | 1 

In [None]:
products[(products['Provider/Company Name']=='Technological Solutions, Inc. (TSI)')]['Sector(s)'].value_counts(normalize=True)
products[(products['Provider/Company Name']=='Technological Solutions, Inc. (TSI)')]['Sector(s)'].isnull()
products.loc[352, 'Sector(s)'] = 'PreK-12'

* __2.9 Code.org:__
  
Company | # times in dataset | Proportions | # Missing value
:--------: | :--------: | :--------: | :--------:
Code.org | 2 | PreK-12 (100%) | 1 

In [None]:
products[(products['Provider/Company Name']=='Code.org')]['Sector(s)'].value_counts(normalize=True)
products[(products['Provider/Company Name']=='Code.org')]['Sector(s)'].isnull()
products.loc[356, 'Sector(s)'] = 'PreK-12'  

* __2.10 EDpuzzle Inc.:__
  
Company | # times in dataset | Proportions | # Missing value
:--------: | :--------: | :--------: | :--------:
EDpuzzle Inc. | 2 | PreK-12 (100%) | 1 

In [None]:
products[(products['Provider/Company Name']=='EDpuzzle Inc.')]['Sector(s)'].value_counts(normalize=True)
products[(products['Provider/Company Name']=='EDpuzzle Inc.')]['Sector(s)'].isnull()
products.loc[370, 'Sector(s)'] = 'PreK-12'  

__3. Primary essential function:__

To deal with missing values of the Primary essential function variable we will use the same reasoning as in the previous section. That is, we will rely on the following two questions and an action:
- What is the proportion of the dominant value?
- How many times does it appear in the dataset?
- Replace the dominant value in the corresponding NaN for each Primary essential function.

Summary: 
There are 13 missing values in Primary essential function. 

In [None]:
products[products['Primary Essential Function'].isnull()]

* __3.1 IXL Learning:__
  
Company | # times in dataset | Proportions | # Missing value
:--------: | :--------: | :--------: | :--------:
IXL Learning | 4 | LC - Digital Learning Platforms (100%) | 1 

In [None]:
products[products['Provider/Company Name']=='IXL Learning']['Primary Essential Function'].value_counts(normalize=True)
products[(products['Provider/Company Name']=='IXL Learning')]['Primary Essential Function'].isnull()
products.loc[61,'Primary Essential Function'] = 'LC - Digital Learning Platforms' 

* __3.2 Microsoft:__
   
Company | # times in dataset | Proportions | # Missing value
:--------: | :--------: | :--------: | :--------:
Microsoft | 6 | LC/CM/SDO - Other (50%) // LC - Sites, Resources & Reference - Games & Simulations (25%) // LC - Content Creation & Curation (25%)| 2 

In [None]:
products[products['Provider/Company Name']=='Microsoft']['Primary Essential Function'].value_counts(normalize=True)
products[(products['Provider/Company Name']=='Microsoft')]['Primary Essential Function'].isnull()
products.loc[183,'Primary Essential Function'] = 'LC/CM/SDO - Other' 
products.loc[293,'Primary Essential Function'] = 'LC/CM/SDO - Other' 

* __3.3 Houghton Mifflin Harcourt:__
    
Company | # times in dataset | Proportions | # Missing value
:--------: | :--------: | :--------: | :--------:
Houghton Mifflin Harcourt | 6 | LC - Courseware & Textbooks (60%) // LC - Study Tools (20%) // LC - Digital Learning Platforms (20%)| 1 

In [None]:
products[products['Provider/Company Name']=='Houghton Mifflin Harcourt']['Primary Essential Function'].value_counts(normalize=True)
products[(products['Provider/Company Name']=='Houghton Mifflin Harcourt')]['Primary Essential Function'].isnull()
products.loc[210,'Primary Essential Function'] = 'LC - Courseware & Textbooks'  

* __3.4 ClassDojo, Inc.:__
    
Company | # times in dataset | Proportions | # Missing value
:--------: | :--------: | :--------: | :--------:
ClassDojo, Inc. | 2 | CM - Classroom Engagement & Instruction - Communication & Messaging (100%) | 1 

In [None]:
products[products['Provider/Company Name']=='ClassDojo, Inc.']['Primary Essential Function'].value_counts(normalize=True)
products[(products['Provider/Company Name']=='ClassDojo, Inc.')]['Primary Essential Function'].isnull()
products.loc[237,'Primary Essential Function'] = 'CM - Classroom Engagement & Instruction - Communication & Messaging'  

* __3.5 Google LLC:__
   
Proportions | # times in dataset | # Missing value
 :--------: | :--------: | :--------: 
LC/CM/SDO - Other                                                            (22.22%) | 30 |  3
LC - Content Creation & Curation                                             (18.52%)
LC - Sites, Resources & Reference                                            (11.11%)
CM - Virtual Classroom - Video Conferencing & Screen Sharing                  (7.41)
CM - Classroom Engagement & Instruction - Communication & Messaging           (7.41%)
LC - Sites, Resources & Reference - Digital Collection & Repository           (7.41%)
CM - Classroom Engagement & Instruction - Assessment & Classroom Response     (3.70%)
SDO - Data, Analytics & Reporting - Site Hosting & Data Warehousing           (3.70%)
LC - Study Tools                                                              (3.70%)
LC - Sites, Resources & Reference - Streaming Services                        (3.70%)
SDO - Learning Management Systems (LMS)                                       (3.70%)
LC - Sites, Resources & Reference - Encyclopedia                              (3.70%)
CM - Classroom Engagement & Instruction - Classroom Management                (3.70%) 

In [None]:
products[products['Provider/Company Name']=='Google LLC']['Primary Essential Function'].value_counts(normalize=True)*100
products[(products['Provider/Company Name']=='Google LLC')]['Primary Essential Function'].isnull().shape
products.loc[248,'Primary Essential Function'] = 'LC/CM/SDO - Other'
products.loc[262,'Primary Essential Function'] = 'LC/CM/SDO - Other' 
products.loc[265,'Primary Essential Function'] = 'LC/CM/SDO - Other'

* __3.6 Adobe Inc.:__
    
Company | # times in dataset | Proportions | # Missing value
:--------: | :--------: | :--------: | :--------:
Adobe Inc | 3 | LC - Content Creation & Curation    (100%)| 1

In [None]:
products[products['Provider/Company Name']=='Adobe Inc.']['Primary Essential Function'].value_counts(normalize=True)
products[(products['Provider/Company Name']=='Adobe Inc.')]['Primary Essential Function'].isnull()
products.loc[305,'Primary Essential Function'] = 'LC - Content Creation & Curation'

* __3.7 Grammarly:__
    
Company | # times in dataset | Proportions | # Missing value
:--------: | :--------: | :--------: | :--------:
Grammarly | 2 | LC - Study Tools      (100%)| 1

In [None]:
products[products['Provider/Company Name']=='Grammarly']['Primary Essential Function'].value_counts(normalize=True)
products[(products['Provider/Company Name']=='Grammarly')]['Primary Essential Function'].isnull()
products.loc[314,'Primary Essential Function'] = 'LC - Study Tools' 

* __3.8 Technological Solutions, Inc. (TSI):__
    
Company | # times in dataset | Proportions | # Missing value
:--------: | :--------: | :--------: | :--------:
TSI | 2 | LC - Sites, Resources & Reference - Games & Simulations      (100%)| 1

In [None]:
products[products['Provider/Company Name']=='Technological Solutions, Inc. (TSI)']['Primary Essential Function'].value_counts(normalize=True)
products[(products['Provider/Company Name']=='Technological Solutions, Inc. (TSI)')]['Primary Essential Function'].isnull()
products.loc[352,'Primary Essential Function'] = 'LC - Sites, Resources & Reference - Games & Simulations'

* __3.9 Code.org:__
   
Company | # times in dataset | Proportions | # Missing value
:--------: | :--------: | :--------: | :--------:
Code.org | 2 | LC - Digital Learning Platforms      (100%)| 1

In [None]:
products[products['Provider/Company Name']=='Code.org']['Primary Essential Function'].value_counts(normalize=True)
products[(products['Provider/Company Name']=='Code.org')]['Primary Essential Function'].isnull()
products.loc[356,'Primary Essential Function'] = 'LC - Digital Learning Platforms'

* __3.10 EDpuzzle Inc.:__
    
Company | # times in dataset | Proportions | # Missing value
:--------: | :--------: | :--------: | :--------:
EDpuzzle Inc. | 2 | LC - Digital Learning Platforms      (100%)| 1

In [None]:
products[products['Provider/Company Name']=='EDpuzzle Inc.']['Primary Essential Function'].value_counts(normalize=True)
products[(products['Provider/Company Name']=='EDpuzzle Inc.')&(products['Primary Essential Function'].isnull())]
products.loc[370,'Primary Essential Function'] = 'LC - Digital Learning Platforms'

We now have the dataset products with no missing values. If we take a look at the description of the Primary essential function variable, we can see that it can be divided into `Categories` (pef_cat) and `Sub-Categories` (pef)

In [None]:
primary_essential_main = []
primary_essential_sub = []

for s in products["Primary Essential Function"]:
    if(not pd.isnull(s)):
        s1 = s.split("-",1)[0].strip()
        primary_essential_main.append(s1)
    else:
        primary_essential_main.append(np.nan)
    
    if(not pd.isnull(s)):
        s2 = s.split("-",1)[1].strip()
        primary_essential_sub.append(s2)
    else:
        primary_essential_sub.append(np.nan)


products["pef_cat"] = primary_essential_main
products["pef"] = primary_essential_sub

### B.District:

The district file districts_info.csv includes information about the characteristics of school districts, including data from NCES (2018-19), FCC (Dec 2018), and Edunomics Lab. In this data set, we removed the identifiable information about the school districts.
It has 233 rows and 7 columns. 
We will now proceed similarly to the treatment given to the products.csv dataset.  

Variable | Description | Missing value | Dtype
:--------: | ------- | :--------: | -------
district_id | The unique identifier of the school district. | 0 | int64
state | The state where the district resides in. | 57 | object
locale | NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural. See Locale Boundaries User's Manual for more information.. | 57 | object
pct_black/hispanic | Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data. | 85 | object
pct_free/reduced | Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data. | 71 | object
county_connections_ratio | ratio (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version). See FCC data for more information. | 71 | object
pp_total_ raw | Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD$) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district. | 115 | object

In [None]:
dis = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')

In [None]:
print('_'*50)
print('General informatio about district.csv')
print()
print('Shape: ',dis.shape)
print()
print('Information: ')
dis.info()
print()
print('Unique State: ', dis['state'].unique())
print('_'*50)

* __Treatment for missing values:__

__1. pp_total_raw variable:__

Next, we will fill in the missing values of the pp_total_raw variable. To do this, we have had to go to the NERD$ website, which you have to register to obtain the economic information for each state. Because the information for the state of New Hampshire is not current, we have decided to remove this state. We have made our own dataset with the average pp_total_raw data for each state, the number of schools and the poverty rate.  

In [None]:
print('pp_total_raw = NaN for each state: \n\n', list(dis[(dis['pp_total_raw'].isnull()) &\
                                                                     (dis['state'].notnull())]['state'].unique()))

* __1.1 Connecticut:__
    
    
State | NERD: pp_total_raw (mean) dolars | pp_total_raw dolars
:--------: | :-------: | :--------: 
Connecticut | 17375 | [16000, 18000[

In [None]:
a = dis[dis['state']=='Connecticut'].reset_index()
for i in a['index']:
    dis.loc[i,'pp_total_raw'] = '[16000, 18000['

* __1.2 Ohio:__
    + Summary: 
    
State | NERD: pp_total_raw (mean) dolars | pp_total_raw dolars
:--------: | :-------: | :--------: 
Ohio | 9015 | [8000, 10000[

In [None]:
b = dis[dis['state']=='Ohio'].reset_index()
for i in b['index']:
    dis.loc[i,'pp_total_raw'] = '[8000, 10000['

* __1.3 California:__
    + Summary: 
    
State | NERD: pp_total_raw (mean) dolars | pp_total_raw dolars
:--------: | :-------: | :--------: 
California | 12460 | [12000, 14000[

In [None]:
c = dis[dis['state']=='California'].reset_index()
for i in c['index']:
    dis.loc[i,'pp_total_raw'] = '[12000, 14000['

* __1.4 Arizona:__
    + Summary: 
    
State | NERD: pp_total_raw (mean) dolars | pp_total_raw dolars
:--------: | :-------: | :--------: 
Arizona | 8658 | [8000, 10000[

In [None]:
d = dis[dis['state']=='Arizona'].reset_index()
for i in d['index']:
    dis.loc[i,'pp_total_raw'] = '[8000, 10000['

* __1.5 North Dakota:__
    + Summary: 
    
State | NERD: pp_total_raw (mean) dolars | pp_total_raw dolars
:--------: | :-------: | :--------: 
North Dakota | 12255 | [12000, 14000[

In [None]:
e = dis[dis['state']=='North Dakota'].reset_index()
for i in e['index']:
    dis.loc[i,'pp_total_raw'] = '[12000, 14000['

* __1.6 New Hampshire:__

Drop it!!

In [None]:
# vamos a eliminar New Hampshire
dis = dis.drop([202,217],axis=0)

* __1.7 New York:__
    + Summary: 
    
State | NERD: pp_total_raw (mean) dolars | pp_total_raw dolars
:--------: | :-------: | :--------: 
New York | 22063 | [12000, 14000[

In [None]:
f = dis[dis['state']=='New York'].reset_index()
for i in f['index']:
    dis.loc[i,'pp_total_raw'] = '[22000, 24000['

__2. pct_free/reduced & pct_bñack/hispanic variables:__

At first glance, by definition of the variables, it would appear that there is a relationship between them. We do not yet know what it might be. Let us first see, in a generic way, the proportions of each variable depending on the locale where they are located. 

In [None]:
#for i in list(['City','Suburb','Town','Rural']):
#    for j in list(['[0, 0.2[', '[0.2, 0.4[', '[0.4, 0.6[', '[0.6, 8[', '[0.8, 1[']):
print(' '*30, 'Pie chart: locale is City:\n')
plt.figtext(0.70, 0.55, 'Case City: Conclusion', fontsize = 15, fontname = 'monospace', color = '#111112')
plt.subplot(2,3,1)
#_____________________________________CITY + 'pct_black/hispanic'=='[0, 0.2['
dis[(dis['locale']=='City') & (dis['pct_black/hispanic']=='[0, 0.2[')]['pct_free/reduced']\
        .value_counts().plot(kind = 'pie',
                             autopct='%1.1f%%',
                             figsize=(12, 12),
                             colors =("#c9061b","#6e6c70","#6e6c70"))

plt.title('Pie chart black/hispanic [0,0.2[')


plt.subplot(2,3,2)
#_____________________________________CITY + 'pct_black/hispanic'=='[0.2, 0.4['
dis[(dis['locale']=='City') & (dis['pct_black/hispanic']=='[0.2, 0.4[')]['pct_free/reduced']\
        .value_counts().plot(kind = 'pie',
                             autopct='%1.1f%%',
                             figsize=(12, 12),
                             colors =("#c9061b","#6e6c70","#6e6c70"))

plt.title('Pie chart black/hispanic [0.2,0.4[')
plt.subplot(2,3,3)
#_____________________________________CITY + 'pct_black/hispanic'=='[0.4, 0.6['
dis[(dis['locale']=='City') & (dis['pct_black/hispanic']=='[0.4, 0.6[')]['pct_free/reduced']\
        .value_counts().plot(kind = 'pie',
                             autopct='%1.1f%%',
                             figsize=(12, 12),
                             colors =("#c9061b","#6e6c70","#6e6c70"))

plt.title('Pie chart black/hispanic [0.4,0.6[')

plt.subplot(2,3,4)
#_____________________________________CITY + 'pct_black/hispanic'=='[0.6, 0.8['
dis[(dis['locale']=='City') & (dis['pct_black/hispanic']=='[0.6, 0.8[')]['pct_free/reduced']\
        .value_counts().plot(kind = 'pie',
                             autopct='%1.1f%%',
                             figsize=(12, 12),
                             colors =("#c9061b","#6e6c70","#6e6c70"))

plt.title('Pie chart black/hispanic [0.6,0.8[')

plt.subplot(2,3,5)
#_____________________________________CITY + 'pct_black/hispanic'=='[0.8, 1['
dis[(dis['locale']=='City') & (dis['pct_black/hispanic']=='[0.8, 1[')]['pct_free/reduced']\
        .value_counts().plot(kind = 'pie',
                             autopct='%1.1f%%',
                             figsize=(12, 12),
                             colors =("#c9061b","#6e6c70","#6e6c70"))

plt.title('Pie chart black/hispanic [0.8, 1[')

#plt.subplot(2,3,6)
#plt.figtext(0.74, 0.42, 'Conclusion', fontsize = 15, fontname = 'monospace', color = '#111112')

plt.figtext(0.70, 0.20, '''* When pct_black/hispanic range is 
[0,0.2[, the most commont value 
(60%) of pct_free/reduced is [0,0.2[.

* When pct_black/hispanic range is 
[0.2,0.4[, the most commont value (60%) 
of pct_free/reduced is [0.4,0.6[.

* When pct_black/hispanic range is 
[0.4,0.6[, the most commont value (50%) 
of pct_free/reduced is [0.4,0.6[.

* When pct_black/hispanic range is 
[0.6,0.8[, two value of pct_free/reduced 
are equal important. 

* When pct_black/hispanic range is 
[0.8,1[, the most commont value (66.7%) 
of pct_free/reduced is [0.8,1[.''', fontsize = 13, 
            fontname = 'monospace', color = '#111112', ha = 'left')

plt.show()

In [None]:
print(' '*30, 'Pie chart: locale is Suburb:\n')
plt.figtext(0.70, 0.55, 'Case Suburb: Conclusion', fontsize = 15, fontname = 'monospace', color = '#111112')
plt.subplot(2,3,1)
dis[(dis['locale']=='Suburb') & (dis['pct_black/hispanic']=='[0, 0.2[')]['pct_free/reduced']\
        .value_counts().plot(kind = 'pie',
                             autopct='%1.1f%%',
                             figsize=(12, 12),
                             colors =("#c9061b","#6e6c70","#6e6c70"))

plt.title('Pie chart black/hispanic [0,0.2[')


plt.subplot(2,3,2)
dis[(dis['locale']=='Suburb') & (dis['pct_black/hispanic']=='[0.2, 0.4[')]['pct_free/reduced']\
        .value_counts().plot(kind = 'pie',
                             autopct='%1.1f%%',
                             figsize=(12, 12),
                             colors =("#c9061b","#6e6c70","#6e6c70"))

plt.title('Pie chart black/hispanic [0.2,0.4[')

plt.subplot(2,3,3)
dis[(dis['locale']=='Suburb') & (dis['pct_black/hispanic']=='[0.4, 0.6[')]['pct_free/reduced']\
        .value_counts().plot(kind = 'pie',
                             autopct='%1.1f%%',
                             figsize=(12, 12),
                             colors =("#c9061b","#6e6c70","#6e6c70"))

plt.title('Pie chart black/hispanic [0.4,0.6[')

plt.subplot(2,3,4)
dis[(dis['locale']=='Suburb') & (dis['pct_black/hispanic']=='[0.6, 0.8[')]['pct_free/reduced']\
        .value_counts().plot(kind = 'pie',
                             autopct='%1.1f%%',
                             figsize=(12, 12),
                             colors =("#c9061b","#c9061b","#6e6c70"))

plt.title('Pie chart black/hispanic [0.6,0.8[')

plt.subplot(2,3,5)
dis[(dis['locale']=='Suburb') & (dis['pct_black/hispanic']=='[0.8, 1[')]['pct_free/reduced']\
        .value_counts().plot(kind = 'pie',
                             autopct='%1.1f%%',
                             figsize=(12, 12),
                             colors =("#c9061b","#c9061b","#6e6c70"))

plt.title('Pie chart black/hispanic [0.8, 1[')
#plt.subplot(2,3,6)
#plt.figtext(0.74, 0.42, 'Conclusion', fontsize = 15, fontname = 'monospace', color = '#111112')

plt.figtext(0.70, 0.20, '''* When pct_black/hispanic range is 
[0,0.2[, the most commont value 
(50.0%) of pct_free/reduced is [0,0.2[.

* When pct_black/hispanic range is 
[0.2,0.4[, the most commont value (73.3%) 
of pct_free/reduced is [0.2,0.4[.

* When pct_black/hispanic range is 
[0.4,0.6[, the most commont value (55.6%) 
of pct_free/reduced is [0.4,0.6[.

* When pct_black/hispanic range is 
[0.6,0.8[, two value of pct_free/reduced 
are equal important [0.2,0.4[ and [0.6,0.8[. 

* When pct_black/hispanic range is 
[0.8,1[, two value of pct_free/reduced 
are equal important [0.6,0.8[ and [0.8,1[.''', fontsize = 13, 
            fontname = 'monospace', color = '#111112', ha = 'left')

plt.show()

In [None]:
print(' '*30, 'Pie chart: locale is Town:\n')
#plt.figtext(0.70, 0.55, 'Case Town: Conclusion', fontsize = 15, fontname = 'monospace', color = '#111112')
plt.subplot(1,2,1)
dis[(dis['locale']=='Town') & (dis['pct_black/hispanic']=='[0, 0.2[')]['pct_free/reduced']\
        .value_counts().plot(kind = 'pie',
                             autopct='%1.1f%%',
                             figsize=(12, 12),
                             colors =("#c9061b","#6e6c70","#6e6c70"))

plt.title('Pie chart Suburb black/hispanic [0,0.2[')


plt.subplot(1,2,2)
dis[(dis['locale']=='Town') & (dis['pct_black/hispanic']=='[0.2, 0.4[')]['pct_free/reduced']\
        .value_counts().plot(kind = 'pie',
                             autopct='%1.1f%%',
                             figsize=(12, 12),
                             colors =("#c9061b","#6e6c70","#6e6c70"))

plt.title('Pie chart Suburb black/hispanic [0.2,0.4[')
print('Case Town: When pct_black/hispanic range is [0, 0.2,[, the most commont value (66.7%) of pct_free/reduced is [0.4, 0.6[.') 
print('When pct_black/hispanic range is [0.2, 0.4,[, the most commont value (100%) of pct_free/reduced is [0.8, 1[.') 

plt.show()

With all this general information, let's analyse which states have pct_free/reduced as missing values.

In [None]:
print('pct_free/reduced missing value states are: \n\n', list(dis[(dis['pct_free/reduced'].isnull()) &\
                                                                              (dis['state'].notnull())]['state'].unique()))

* __2.1 Massachusetts:__
    + Summary: 
    
State | locale | pct_black/hispanic | pp_free/reduced 
:--------: | :-------: | :-------: | :-------: 
Massachusetts | Suburb | [0, 0.2[ | [0, 0.2[
Massachusetts | Rural | [0, 0.2[ | [0.4, 0.6[
Massachusetts | City | [0, 0.2[ | [0, 0.2[
Massachusetts | Suburb | [0.2, 0.4[ | [0.2, 0.4[
Massachusetts | Suburb | [0.4, 0.6[ | [0.4, 0.6[

In [None]:
a = dis[(dis['state']=='Massachusetts') & (dis['locale']=='Suburb') & (dis['pct_black/hispanic']=='[0, 0.2[')].reset_index()
for i in a['index']:
    dis.loc[i,'pct_free/reduced'] = '[0, 0.2['
b = dis[(dis['state']=='Massachusetts')  & (dis['locale']=='Rural') & (dis['pct_black/hispanic']=='[0, 0.2[')].reset_index()
for i in b['index']:
    dis.loc[i,'pct_free/reduced'] = '[0.4, 0.6['
c = dis[(dis['state']=='Massachusetts') & (dis['locale']=='City')  & (dis['pct_black/hispanic']=='[0, 0.2[')].reset_index()
for i in c['index']:
    dis.loc[i,'pct_free/reduced'] = '[0, 0.2['
d = dis[(dis['state']=='Massachusetts') & (dis['locale']=='Suburb') & (dis['pct_black/hispanic']=='[0.2, 0.4[')].reset_index()
for i in d['index']:
    dis.loc[i,'pct_free/reduced'] = '[0.2, 0.4['
e = dis[(dis['state']=='Massachusetts') & (dis['locale']=='Suburb') & (dis['pct_black/hispanic']=='[0.4, 0.6[')].reset_index()
for i in e['index']:
    dis.loc[i,'pct_free/reduced'] = '[0.4, 0.6['

* __2.2 Ohio:__
    + Summary:  
    
State | locale | pct_black/hispanic | pp_free/reduced 
:--------: | :-------: | :-------: | :-------: 
Ohio | City | [0.4, 0.6[ | [0.4, 0.6[

In [None]:
f = dis[(dis['state']=='Ohio') & (dis['locale']=='City') & (dis['pct_black/hispanic']=='[0.4, 0.6[')].reset_index()
for i in f['index']:
    dis.loc[i,'pct_free/reduced'] = '[0.4, 0.6['

* __2.3 Arizona:__
    + Summary:  
    
State | locale |  pp_free/reduced 
:--------: | :-------: | :-------:
Arizona | City | [0.8, 1[

(Remember that Arizona only has one school).

In [None]:
g = dis[(dis['state']=='Arizona') & (dis['locale']=='City')].reset_index()
for i in g['index']:
    dis.loc[i,'pct_free/reduced'] = '[0.8, 1['

* __2.4 Tennesse:__
    + Summary:  
    
State | locale | pct_black/hispanic | pp_free/reduced 
:--------: | :-------: | :-------: | :-------: 
Tennesse | Suburb | [0, 0.2[ | [0, 0.2[
Tennesse | Rural | [0.2, 0.4[ |[0.8, 1[

In [None]:
h = dis[(dis['state']=='Tennessee') & (dis['locale']=='Suburb') & (dis['pct_black/hispanic']=='[0, 0.2[')].reset_index()
for i in h['index']:
    dis.loc[i,'pct_free/reduced'] = '[0, 0.2['
ii = dis[(dis['state']=='Tennessee') & (dis['locale']=='Rural') & (dis['pct_black/hispanic']=='[0.2, 0.4[')].reset_index()
for i in ii['index']:
    dis.loc[i,'pct_free/reduced'] = '[0.8, 1['
    
dis.loc[36,'pct_free/reduced'] = '[0.8, 1['
dis.loc[50,'pct_free/reduced'] = '[0.4, 0.6['
dis.loc[149,'pct_free/reduced'] = '[0.8, 1['

__3. country_connections_ratio variables:__

This variable is very interesting. It represents the broadband connection for fixed residents in a territory divided by the number of households. If the indicator is greater than unity, it would mean that there are more connections than households, which would imply that a household can have two connections. This could very well be the case. However, the majority of states in the selected sample have a value lower than unity. This would imply that not all households have an internet connection. This categorical variable has two possible solutions: [0.18,1[ and [1,2[. I have seen that it is too generic, so searching through the websites recommended by the contest sponsors, I have found some tables for each state where the index can be deduced. This variable will be a great ally for our students to be able to participate in online learning and not be left behind. 

Before proceeding, let's replace the missing NaN values with the predominant value, [0.18,1[. Then we will delete all rows where there are 5 variables that do not have any value. We have tried to find the district_id values in the tables given by the sponsors' websites, but it was not impossible to find out which district_id actually belongs to which locality. An interesting fact that we will see later is that I was able to find out the school in North Dakota. Do you want to see how? 
Now we'll focus on replacing the NaN values with [0.18,1[.

In [None]:
dis['county_connections_ratio'].value_counts(normalize=True).plot(kind='barh')
plt.title('County connections ratio', size=15)
plt.ylabel('Options')
plt.show()

In [None]:
dis['county_connections_ratio'] = dis['county_connections_ratio'].replace(np.nan, '[0.18, 1[')

In [None]:
dis = dis.dropna(thresh=5)

Are all variables without missing values?

In [None]:
dis.info()

In [None]:
np.where(dis['pct_free/reduced'].isnull())

In [None]:
dis = dis.reset_index()
dis.drop(['index'], axis=1, inplace=True)
dis.loc[36,'pct_free/reduced'] = '[0.8, 1['
dis.loc[50,'pct_free/reduced'] = '[0.4, 0.6['
dis.loc[149,'pct_free/reduced'] = '[0.8, 1['

__4. Summary after tractement values:__

In [None]:
def func_plot(df,col1):
    fig,ax = plt.subplots(1,2,figsize=(14,5))
    sns.countplot(data=df,y=col1,ax=ax[0],order=df[col1].value_counts().index,orient="v")
    ax[0].set_title(col1)
    ax[1].pie(x=df[col1].value_counts(),labels=df[col1].value_counts().index,autopct='%1.0f%%')
    ax[1].set_title(col1)

func_plot(dis,"pct_black/hispanic")
func_plot(dis,"pct_free/reduced")
func_plot(dis,"county_connections_ratio") 
func_plot(dis,"pp_total_raw") 

Now let's build a dataFrame which will represent the values of broadband connection per household in each state. I only consider the permanent residents in each state. (This new dataframe is taken from the analysis for _Annex I._) The table has the following variables: 

Variable | Description
:--------: | :-------: 
State | State name
25% | The 0.25 quartile of people using broadband in each state
50% | The 0.50 quartile of people using broadband in each state
75% | The 0.75 quartile of people using broadband in each state
range | The range that is compressed the value 50%.

Interestingly, seeing that the contest sponsors have determined that all states except North Dakota have a ratio between [0.18,1[, we have eliminated all connections greater than unity. 
Another point to note is that we are going to choose the variable 50% as the numerical value of connectivity/households in each state.

In [None]:
conec_2 = pd.DataFrame()
conec_2[['state','25%','50%','75%','range']] = None
state = ['Illinois', 'Utah', 'Wisconsin', 'North Carolina', 'Missouri',
       'Washington', 'Connecticut', 'Massachusetts', 'New York',
       'Indiana', 'Virginia', 'Ohio', 'New Jersey', 'California',
       'District Of Columbia', 'Minnesota', 'Arizona', 'Texas',
       'Tennessee', 'Florida', 'North Dakota', 'Michigan']
first = [0.60,0.78,0.73,0.64,0.57,0.72,0.85,0.83,0.73,0.62,0.59,0.67,0.88,0.79,0.69,0.71,0.72,0.59,0.60,0.66,1.03,0.66]
segund = [0.69,0.86,0.77,0.76,0.65,0.83,0.87,0.88,0.78,0.69,0.67,0.74,0.94,0.86,0.73,0.75,0.78,0.69,0.67,0.85,1.05,0.73]
third = [0.76,0.89,0.83,0.85,0.74,0.89,0.89,0.92,0.86,0.76,0.77,0.79,0.95,0.92,0.88,0.80,0.85,0.78,0.75,0.94,1.07,0.79]
rango  = ['[0.6, 0.8[','[0.8, 1[','[0.7, 0.9[','[0.7, 0.9[','[0.6, 0.8[','[0.7, 0.9[','[0.8, 1[','[0.8, 1[','[0.7, 0.9[',
          '[0.6, 0.8[','[0.6, 0.8[','[0.6, 0.8[','[0.8, 1[','[0.8, 1[','[0.6, 0.8[','[0.7, 0.9[','[0.7, 0.9[','[0.6, 0.8[',
          '[0.6, 0.8[','[0.7, 0.9[','[1, 2[','[0.6, 0.8[']

conec_2['state'] = state
conec_2['25%'] = first
conec_2['50%'] = segund
conec_2['75%'] = third
conec_2['range'] = rango
        
conec_2.head(2)

In [None]:
conec_2['range'].value_counts(normalize=True).plot(kind='pie',autopct='%1.1f%%',
                                                   figsize=(6, 6),
                                                   colors =("#c9061b","#6e6c70","#6e6c70","#6e6c70"))
plt.title('El rango de consumer/hhs', fontsize=15)
plt.figtext(0.95, 0.50, '''The 40.9 % of connection ratio is in range [0.6, 0.8[. 
After that, 31.8% of connection ratio is in range [0.7, 0.9[''', fontsize = 12, 
            fontname = 'monospace', color = '#111112', ha = 'left')
plt.show()

In [None]:
fig= plt.figure(figsize=(15,5))
sns.lineplot(conec_2['state'],conec_2['50%'],color = '#c9061b')
sns.lineplot(conec_2['state'],conec_2['25%'],color = '#676A6C')
sns.lineplot(conec_2['state'],conec_2['75%'],color = '#0a0a0a')
plt.ylabel('Ratio connections', fontsize=13)
plt.xlabel('States', fontsize=13)
plt.axvline('North Dakota',color = "red", linewidth = 1, linestyle = "dashed")
plt.xticks(rotation=60)
plt.legend(['50% ratio', '25% ratio','75% ratio'], ncol=3)
plt.title('Ratio connections for State', fontsize=15)
plt.show()

At this point, we are going to join the two dataframes.

In [None]:
dist = pd.merge( dis,conec_2, how='inner', on='state')
dist.info()

Previously, we have seen that we have been able to replace the NaN values of the pp_total_raw variable by ranges of values. This is not random. This is thanks to an elaborate study of the tables provided by the contest sponsors' websites. Then, I have created a new dataframe with the following variables involved: 

Variable | Description
:--------: | :-------: 
state | State name
pp_total_raw_nerd | The average value of pp_total_raw given by NERD website.
pover | The poverty rate according to the 2018-2019 tables. Poverty is defined as a family of two adults and two children with an annual income of less than 24500 dollars. 
error_poverty | The standard variation of the poverty variable. 
#schools | The actual number of schools in each state.
Cardinal_points | Where each state is located according to the map of the United States. The options are: NorthEast, MidWest, South, West.
type_state | Indicates the political party governing that state in 2019-2020. There are two types: Republicans and Democrats. 

In [None]:
poverty2 = pd.DataFrame()
poverty2[['state','pp_total_raw_nerd','pover','error_poverty','#schools','Cardinal_points','type_state']] = None
state = ['Illinois', 'Utah', 'Wisconsin', 'North Carolina', 'Missouri',
       'Washington', 'Connecticut', 'Massachusetts', 'New York',
       'Indiana', 'Virginia', 'Ohio', 'New Jersey', 'California',
       'District Of Columbia', 'Minnesota', 'Arizona', 'Texas',
       'Tennessee', 'Florida', 'North Dakota', 'Michigan']
pp_total_raw_nerd = [12152,8346,12042,10068,10273,14364,14375,16036,22063,11262,9885,9015,16618,12460,20579,
                     12588,8658,9645,10070,8525,12255,10750]
pover = [15.1,10.0,11.8,18.6,16.0,11.6,14.1,11.3,17.3,15.5,12.8,17.8,12.0,15.3,20.5,10.7,18.4,18.9,19.6,17.5,9.9,17.1]
error_poverty=[0.45,0.56,0.66,0.54,0.72,0.45,0.73,0.50,0.41,0.58,0.47,0.53,0.45,0.27,2.39,0.74,0.69,0.32,0.67,0.45,1.63,0.62]
schools = [3872,1797,2262,2646,2161,2384,1001,1845,4687,1797,3427,1861,2473,10087,231,2362,2029,8641,1752,3611,476,3382]
cardinalPoints  = ['Midwest','West','Midwest','South','Midwest','West','Northeast','Northeast','Northeast','Midwest','South',
         'Midwest','Northeast','West','South','Midwest','West','South','South','South','Midwest','Midwest']
types_governor = ['D','R','D','D','R','D','D','R','D','R','D','R','D','D','D','D','R','R','R','R','R','D']
        
poverty2['state'] = state
poverty2['pp_total_raw_nerd'] = pp_total_raw_nerd
poverty2['pover'] = pover
poverty2['error_poverty'] = error_poverty
poverty2['#escuelas'] = schools
poverty2['Cardinal_points'] = cardinalPoints
poverty2['type_state'] = types_governor
poverty2.info()

**Is the distribution of the location of the states homogeneous?**

Response: Midwest

In [None]:
poverty2['Cardinal_points'].value_counts().plot(kind='pie', autopct='%1.1f%%',
                                                   figsize=(12, 12),
                                                   colors =("#c9061b","#6e6c70","#6e6c70","#6e6c70"))
plt.show()

**In which state has the highest poverty rate?**

Response: DC

In [None]:
fig, ax = plt.subplots(figsize=(18,12))
ax.errorbar(x = poverty2['state'], y = poverty2['pover'], yerr = poverty2['error_poverty'], marker = 'o', ecolor = 'red')
ax.set_xticklabels(poverty2['state'],rotation=90)
ax.set_title('Poverty State in percent', fontsize=15)
plt.show()

### C. Engagement:

The extension collects page load events of over 10K education technology products in our product library, including websites, apps, web apps, software programs, extensions, ebooks, hardwares, and services used in educational institutions. The engagement data have been aggregated at school district level, and each file represents data from one school district.

Variable | Description | Missing value | Dtype
:--------: | ------- | :--------: | -------
time | date in "YYYY-MM-DD" | 0 | object
lp_id | The unique identifier of the product. | 541 | float64
pct_access | Percentage of students in the district have at least one page-load event of a given product and on a given day. | 13447 | float
engagement_index | Total page-load events per one thousand students of a given product and on a given day | 5378409 | float
district_id | I've include this variable | 0 | object

In [None]:
path = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data' 
files = glob.glob(path + "/*.csv")

csv_list = []

for filename in files:
    df = pd.read_csv(filename, index_col=None, header=0)
    district_id = filename.split("/")[4].split(".")[0]
    df["district_id"] = district_id
    csv_list.append(df)
    
engagement_data = pd.concat(csv_list)
engagement_data = engagement_data.reset_index(drop=True)
engagement_data.head()

In [None]:
print('_'*50)
print('General informatio about engagement.csv')
print()
print('Shape: ',engagement_data.shape)
print()
print('Information: ')
engagement_data.info()
print()
print('_'*50)

In [None]:
eng = engagement_data.copy()

* __Treatment for missing values:__

We see that there are many missing values in three of the five columns. One way to lose the least amount of information is to delete the rows that have 3 of the 5 variables equal to NaN. 


In [None]:
eng = eng.dropna(subset=['lp_id'])
eng = eng.dropna(subset=['pct_access'])
eng = eng.dropna(subset=['engagement_index'])

### Join all datasets:



In order to relate the different variables of each dataset we have to link the different datasets so that it is easier to manipulate them. Recall that so far we have the following datasets available:
- products.csv
- districts.csv
- engagement.csv
- poverty2.csv


In [None]:
merged_data = pd.merge(products, eng, left_on = 'LP ID', right_on = 'lp_id')
merged_data['district_id'] = merged_data['district_id'].astype('int64')
df_total = pd.merge(merged_data, dist, on = 'district_id')

In [None]:
df_total = pd.merge(df_total, poverty2,on = 'state' )

## Exploratory analysis:

In this document, the sponsors of the competition have chosen 174 schools out of the 64784 schools. In other words, the sample to be studied is 0.27%. It should also be noted that not all states are represented. A total of 23 states have been selected. In the following, different characteristics of the selected schools and states will be shown:


In [None]:
print('Total School: {}'.format(poverty2[['#escuelas']].sum()))
print('Sample School muestra: {}'.format(dist['state'].value_counts().sum()))
print('% of total school is: {}'.format((174/64784)*100))

In the following, the location of the chosen schools and the number of schools in each state of the above-mentioned sample will be shown.

In [None]:
dist['state_abbrev'] = dist['state'].replace(us_state_abbrev)

Geographical map of the distribution of the states and the number of schools in it (with the sample given by the sponsors):

In [None]:
school_state = dist['state_abbrev'].value_counts().to_frame().reset_index(drop=False)
school_state.columns = ['state_abbrev', 'num_districts']

fig = go.Figure()
layout = dict(
    title_text = "Number of Available School Districts per State",
    geo_scope='usa',
)

fig.add_trace(
    go.Choropleth(
        locations=school_state.state_abbrev,
        zmax=1,
        z = school_state.num_districts,
        locationmode = 'USA-states', # set of locations match entries in `locations`
        marker_line_color='white',
        geo='geo',
        colorscale=px.colors.sequential.Teal, 
    )
)
            
fig.update_layout(layout)   
fig.show()

In this graph you can see that Connecticut has 30 schools followed by 29 in Utah. The criteria for the selection of schools I leave to the sponsor. In my personal opinion, it would have been better to have a more uniform sample to make the results more comparative. But this would be a future exploration. I leave the link to the total number of schools for this future research [1].
The states can be classified in different ways: by location, by % presence of people of colour, by % of students requesting a reduction in their lunch allowance.
A locale classification is a general geographic indicator that describes the type of area where a school is located. NCES classifies[2] all territory in the U.S. into four types - Rural, Town, Suburban, and City.
[2] 

Continuing the thread of the document, states can be classified according to their rulers: Republican (R) or Democrat (D).



In [None]:
# Temporary Dataframe for Checking the Distribution of Locale in every State
temp = pd.crosstab(df_total.state, df_total.locale)
temp["summation"] = temp.sum(axis=1)
temp["city_percent"] = temp.City*100/temp.summation
temp["rural_percent"] = temp.Rural*100/temp.summation
temp["suburb_percent"] = temp.Suburb*100/temp.summation
temp["town_percent"] = temp.Town*100/temp.summation

# State and locale Distribution Plot
fig = go.Figure()
fig.add_trace(go.Bar(x=temp.index, y=temp.city_percent, name="Percentage City",
                    marker_color=px.colors.qualitative.Antique[10]))
fig.add_trace(go.Bar(x=temp.index, y=temp.rural_percent, name="Percentage Rural",
                    marker_color=px.colors.qualitative.Set2[7]))
fig.add_trace(go.Bar(x=temp.index, y=temp.suburb_percent, name="Percentage Suburb",
                    marker_color=px.colors.qualitative.Dark2[5]))
fig.add_trace(go.Bar(x=temp.index, y=temp.town_percent, name="Percentage Town",
                    marker_color=px.colors.qualitative.Vivid[0]))

# fig.update_traces(texttemplate='%{text:.2s}')
fig.update_xaxes(tickfont_size=16, tickangle=270)
fig.update_layout(font_family='Arial', 
                  title=dict(text="<b>Locale Distribution per State", 
                             font_size=20, x=0.5),
                  barmode='stack', height=600,
                  legend=dict(orientation="h",
                              yanchor="bottom",
                              y=1.02,xanchor="right",
                              x=1
                ))
fig.show()
del temp

We can observe that the vast majority of schools are located in suburbs. However, there are schools in states such as Arizona, California, Texas and Washington where they are mostly located in cities. One fact to note is that there is very little representation of schools located in Town. I leave it to the discretion of the sponsor. 

In [None]:
#fig, ax = plt.figure(1,2,1)
plot = poverty2['type_state'].value_counts().plot(colors = ("#0d0d8c","#e80219"),
                                                      kind = 'pie',
                                                      autopct='%1.1f%%',
                                                      figsize=(8, 8),
                                                      startangle=20).legend(loc="upper right");

plt.title('Distribution of government states', size=15)
plt.axis('off')
plt.show()
#fig, ax = plt.figure(1,2,2)
plot = poverty2['Cardinal_points'].value_counts().plot(colors = ("#3fab29","#5bab4b",'#76ad6c','#95b58f'),
                                                      kind = 'pie',
                                                      autopct='%1.1f%%',
                                                      figsize=(8, 8),
                                                      startangle=20).legend(loc="upper right");

plt.title('Distribution of states cardinal points', size=15)
plt.axis('off')
plt.show()

It can be seen that 54.5% of the sample are democratic states. However, the sample is evenly distributed in this respect. Another aspect to take into account is the distribution of states according to cardinal points. Is there homogeneity? It is clear that 36.4% of the chosen states are located in the Midwest. Therefore, the conclusions for this sample are 3:
1. the sample is not homogeneous in number of schools:
2. the sample is not homogeneous with respect to its distribution of the cardinal points of the USA.
3. the sample is homogeneous in the type of governor.

Once we have a complete map of the overall choice of our sample, we will analyse the essential products, sectors and functionalities used for children to follow their learning online. It is essential to go into this area with the idea that international companies such as Google and Microsoft and services such as Zoom are likely to be at the top of the list. We will see below, if our hypothesis is true:

**¿Which 10 technologies are the most widely used??**

Students who have been able to continue their studies online have used a multitude of products to further their studies and finish the school year. All of them have been provided by large multinationals such as Google, Microsoft, Amazon and smaller ones. But all of them have contributed effectively to helping students continue their education. To see the most used 'products', we will see below a graph representing a list of the top 20 most used products:

In [None]:
#Plotting a Bar Graph to know which is the product that is used by most of the students

plt.figure(figsize=(15,7))
most_used_product= sns.countplot(x = "Product Name",data= df_total, 
              order=df_total["Product Name"].value_counts().index[:20],palette = 'Spectral')
for p in most_used_product.patches:
    most_used_product.annotate(f'\n{p.get_height()}', (p.get_x()+0.4, p.get_height()), 
                               ha='center' ,va='center', color='black', size=11)
plt.title("Top 20 Product Name", size=15)
plt.xticks(rotation=90)
plt.show()

Google Docs, Google Drive and Google Classroom stand out from the others. These are products offered by the company 'Google'. Most students have relied on Google products to enable them to continue their online education without being interrupted. Also in the top 10 are Khan Academy, which was kind enough to offer the platform for free school courses, and Wikipedia. It is also amusing to see that our students have taken advantage of this confinement to catch up on films and series (Netflix). This fact is an index of lifestyle modification during the confinement. Netflix provided a section where you could watch a movie with your friends at the same time (as if you were at the cinema or had a date with them) to foster a group feeling. 

Next, we will look at which companies are prominent in their product offerings. Surely `Google` is in the first positions seeing the success of the use of their products by our students.

In [None]:
plt.figure(figsize=(10,8))
company= sns.countplot(x = "Provider/Company Name",data= df_total, hue = 'type_state',
                  order=df_total["Provider/Company Name"].value_counts(normalize=True).index[:5],
                       orient='h',palette ='Spectral')
for p in company.patches:
    company.annotate(f'\n{p.get_height()}', (p.get_x()+0.4, p.get_height()), 
                                                      ha='center',
                                                      va='center', 
                                                      color='black', size=14)
plt.title("Top 5 Company Name", size=15)
plt.xticks(rotation=90)
plt.show()

The initial hypothesis is confirmed. Google is the company most used by our students, followed by Microsoft. **Is there a difference between Republicans and Democrats?** As can be seen, there is a curious peculiarity: Democrats use online companies more than Republicans. However, this conclusion can be reduced by the fact that in the chosen sample there are more Democrats than Republicans. 
These companies provided different products for different uses, called sectors. Sector of education where the product is used. These uses are three:
1. PreK-12
2. Higher Ed
3. Corporate

Or a combination of all of them. This means that a company can provide a product that serves all 3 sectors of education. An example is 'Amazon.com, Inc.` An interesting question would be: in which sector of education have companies put more effort: prek-12, higher Ed or Corporate?

In [None]:
import re

In [None]:
temp_sectors = products['Sector(s)'].str.get_dummies(sep="; ")
temp_sectors.columns = [f"sector_{re.sub(' ', '', c)}" for c in temp_sectors.columns]
products_info = df_total.join(temp_sectors)
products_info.drop("Sector(s)", axis=1, inplace=True)
del temp_sectors

#After One-Hot Encoding
products_info.head(2)

In [None]:
plot = df_total['Sector(s)'].value_counts().plot(colors = ('#0ced53',"#3fab29","#5bab4b",'#76ad6c','#e3eb05'),
                                                      kind = 'pie',
                                                      autopct='%1.1f%%',
                                                      figsize=(12, 12),
                                                      startangle=20).legend(loc="upper right");

plt.title('Distribution of Sectors', size=15)
plt.axis('off')
plt.show()

We see the different combinations of the three education sectors. We can see that most of the companies base their products on Prek-12 (40.7%) and on the combination of the three sectors (40.6%). This means that companies have not distinguished, in general terms, which sector provides the most products. 
If we recapitulate, we know that:
1. the company that stands out above the others in offering its products for the online student is `Google` followed by `Microsoft`. 
2. the companies provide their services without much distinction to the three sectors of education.

The basic functions of each product will be discussed below. There are two layers of labels here. Products are first labeled as one of these three categories:
1. LC = Learning & Curriculum
2. CM = Classroom Management 
3. SDO = School & District Operations

Each of these categories have multiple sub-categories with which the products were labelled. In the following we will first analyse the combination of categories and sub-categories and then look at each one separately. 
Categories:

In [None]:
#Plotting a Bar Graph to know which is the product that is used by most of the students
plt.figure(figsize=(10,8))
most_used_primary_essential_function= sns.countplot(x = "Primary Essential Function",data= df_total, 
              order=(df_total["Primary Essential Function"].value_counts()/len(df_total)).index[:10],
                                                    palette = 'Spectral')
for p in most_used_primary_essential_function.patches:
    most_used_primary_essential_function.annotate(f'\n{p.get_height()}', (p.get_x()+0.4, p.get_height()), 
                                                  ha='center',
                                                  va='center', 
                                                  color='black', size=14)
plt.title("Top 10 Primary Essential Function", size=15)
plt.xticks(rotation=90)
plt.show()

According to this graph, it can be seen that the first 4 positions are located within the LC category. The first primary essential function is `LC - Digital Learning Platforms` followed by `Sites, Resources and Reference`. This makes sense as online platforms for study became the main tool for students to continue their studies. In the first 10 positions all primary essential functions are within the 'LC' category. It can be deduced that the category Learning & Curriculum will be the most used by our students. It would be interesting if this category was divided into Learning and Curriculum separately, because due to the pandemic, companies that used to host interns stopped providing this service, so there was not a lot of job or internship searching during that period of time. 

The following graph shows the representation of the different categories:

In [None]:
fig, ax = plt.subplots(1,1, figsize=(14,7))


#plt.figure(figsize=(10,8))
most_used_primary_essential_function= sns.countplot(x = "pef_cat",data= df_total, 
              order=df_total["pef_cat"].value_counts().index[:10])

for p in most_used_primary_essential_function.patches:
    most_used_primary_essential_function.annotate(f'\n{p.get_height()}', (p.get_x()+0.4, p.get_height()), 
                                                  ha='center',
                                                  va='center', 
                                                  color='white', size=15)
plt.title("Principal Categories", size=15)
plt.xticks(rotation=90)
plt.show()

temp = df_total['pef_cat'].value_counts().reset_index()
temp.columns = ['pef_cat', 'percent']
temp['percent'] /= len(df_total)

fig = px.pie(temp, names='pef_cat', values='percent',
    color_discrete_sequence=px.colors.qualitative.D3,
    width=700,
    height=500,
)
fig.update_layout(title=dict(text="<b>Principal Categories Distribution", x=0.5, font_size=15))
fig.show()

Our hypothesis is confirmed: these two graphs show that the `LC` category is the most used (79.6%). Now it is interesting to look at the sub-categories and their distributions.

In [None]:
temp = df_total['pef'].value_counts().reset_index()
temp.columns = ['pef', 'percent']
temp['percent'] /= len(df_total)

fig = px.pie(temp, names='pef', values='percent',
    color_discrete_sequence=px.colors.qualitative.D3,
    width=1000,
    height=500,
)
fig.update_layout(title=dict(text="<b>Sub-category wise Products Distribution", x=0.5, font_size=15))
fig.show()

29.6% of students use products from the sub-category: `Sites, Resources & Reference`. This is followed by 21.1% for `Digital Learning Platform`. This means that the products that students use are mostly for studying and for this they use digital platforms with free software or cloud technology so that, regardless of the memory and hardware they use, they can store and search for their references with the greatest convenience. It is interesting to note that only 2.08% used `Virtual Classroom`. This would indicate that only 2.08% had the option of streaming classes between teacher and student, implying that the continuity of classes simulating face-to-face mode (i.e., that every day they could give the subject as in traditional schools) followed a more complete and dynamic curriculum than the rest of the students.  This implies a lack of daily monitoring by teachers, implying an inequality in the content taught. Next, we could look at the different products that comprise the `Virtual Classroom`. The ones that stand out are `Meet` and `Zoom`. In order to visualise this graph I had to average the variable `pct_access`, which means % of students in the district have at least one page-load event of a given product and on a given day. We got a bit ahead of ourselves, as I first wanted to take a snapshot of each of the variables involved and then we would see how they evolved over time. In order to be able to visualise it better, the weekend periods have been eliminated as there are no virtual classes in streaming. However, it is interesting to see that at the beginning the average pct_acces of Meet did not reach 11%. What does this mean? 11 % of students in the district have at least one page-load event of Zoom or Meet. In the months of July, August there is practically no activity until the school year 2020-2021 autumn semester starts. It can be seen that, due to the abundant outbreaks in the different states due to international and domestic travel, it was again decreed to close the classrooms of all schools, in order to avoid contagion. The schools were more prepared to offer their curriculum online and the homes had been able to prepare a little more to obtain a device so that the student could continue their studies. That is why the pct_access increases to 18% during the autumn semester. 

Below are two graphs providing a visual 'summary' of what has been explained so far.

In [None]:
temp = df_total.groupby(["Sector(s)", "pef_cat"])["URL"].count().reset_index()
temp.columns = ["Sector", "pef_cat", "Counts"]

#plot for Distribution of Category and Seg_Sub_Catgroy
fig = px.sunburst(data_frame=temp, path=["Sector", "pef_cat"], values="Counts", 
                 color_discrete_sequence=px.colors.qualitative.D3)
fig.update_layout(font_family="Arial",
                  title=dict(text="<b>Distribution of Categories among various Sectors", font_size=20, x=0.5),
                  font_size=16)
fig.show()
del temp
temp = df_total.groupby(["pef_cat", "pef"])["URL"].count().reset_index()
temp.columns = ["pef_cat", "pef", "Counts"]
#plot for Distribution of Category and Seg_Sub_Catgroy
fig = px.sunburst(data_frame=temp, path=["pef_cat", "pef"], values="Counts", 
                 color_discrete_sequence=px.colors.qualitative.D3)
fig.update_layout(font_family="Arial",
                  title=dict(text="<b>Distribution of Sub Categories among various Categories",
                             font_size=20),
                  font_size=16)
fig.show()
del temp

With these two interactive graphs we see the proportions of each main category with respect to its sectors and each sub-category with respect to its main category.

With all this in mind, we will change course and analyse all of the above but incorporate new variables:

1. pct_access: % of students in the district have at least one page-load event of a given product and on a given day
2. engagement_index: Total page-load events per 1000 students of a given product and on a given day
3. time: el formato es (YY-MM-DD).

First we will look at the variations of `pct_access and engagement_index` in the different states according to the `main category` they belong to:

In [None]:
grouped_districts = df_total.groupby(by=["state"])[['pct_access','engagement_index']].mean()

df_total["state_code"] = df_total.state.map(us_state_abbrev)
grouped_districts = grouped_districts.reset_index()
grouped_districts["state_abv"] = grouped_districts.state.map(us_state_abbrev)

def map_plot(dataframe, location, color, hover):
    fig = px.choropleth(data_frame=dataframe, locations=location, locationmode="USA-states",
                    color=color, scope="usa", hover_name=hover, color_continuous_scale="viridis_r")
    fig.update_layout(font_family="Arial",)
    return fig

fig = map_plot(grouped_districts, "state_abv", "pct_access", "state")
fig.update_layout(title=dict(text="<b>Comparison of State's pct access levels", font_size=20))
fig.show()

fig = map_plot(grouped_districts, "state_abv", "engagement_index", "state")
fig.update_layout(title=dict(text="<b>Comparison of State's engagement index levels", font_size=20))
fig.show()

In the upper map we can see the average pct_access depending on the state and in the lower map, it represents the average engagement_index.
Regarding the average pct_access:
- North Dakota (3.77) and Arizona (2.85) have the highest mean pct_access and the lowest value is in North Carolina (0.49).
Regarding the average engagement_index:
- Arizona (858.62) and North Dakota (503.88) have the highest mean engagement_index and the lowest value is in Tennessee (117.33).

**Could there be a correlation between the two variables?**

In [None]:
# Correlation between various Indexes
temp = df_total[['pct_access','engagement_index']]
plt.figure(figsize=(10,8))
sns.heatmap(temp.corr(), annot=True, cmap="viridis_r")
plt.xticks(size=12, rotation=0)
plt.yticks(size=12, rotation=0)
plt.title("Corrleation between pct access & engagement index variables", size=20)
plt.show()

**Are there more correlations between variables that we have not yet tested?** As we can see from the correlation, there is a strong positive correlation between the two variables. This result is logical because of the definitions of the two variables. To finish this analysis, let's look at the average behaviour of the two variables during a standard week.**On weekends, will students continue studying as they do during the whole week, in order to reinforce the knowledge acquired?**

In [None]:
def bar_plot(dataframe, feature1, feature2):
    if len(feature2)>1:
        fig = make_subplots(rows=1, cols=len(feature2), subplot_titles=feature2)
        for i in range(len(feature2)):
            fig.add_trace(go.Bar(x=dataframe[feature1[0]], y=dataframe[feature2[i]],
                                 name=feature2[i]), row=1, col=i+1)
        fig.update_layout(font_family="Arial", showlegend=False, margin=dict(l=0, r=0, t=100, b=50), 
                          height=400)
        fig.update_traces(marker_color='rgb(158,202,225)', marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5, opacity=0.6)
        return fig
eng["week"] = pd.to_datetime(eng.time).dt.weekofyear

# extracting day of week from time
eng["day_of_week"] = pd.to_datetime(eng.time).dt.dayofweek
#analysing the affect of day of the week on engagement_index and pct_access 
temp = eng.groupby("day_of_week")[["engagement_index","pct_access"]].mean().reset_index()

# visulaizing the affect of day of the week on engagement_index and pct_access 
fig = bar_plot(temp, ["day_of_week"], ["engagement_index","pct_access"])
fig.update_layout(title=dict(text="<b>Day of the WEEK affect</b>", x=0.5, font_size=20))
fig.show()

The answer to the question is no. Students do not spend as much time at the weekend downloading materials to study. In both variables the proportion of the variables drops by 2/3. We should also note that the two variables follow a very similar behaviour during the week. 

Now we are going to look at the `temporal dependence` with the different variables, to see their evolution throughout the year 2020. These variables are:

**1. State**

In [None]:
def time_plot(df, factor, label):
    for i in ["pct_access","engagement_index"]:
        fig = px.line(data_frame=df, x="week", y=i, color=factor)
        fig.update_layout(font_family="Arial",title=dict(text="<b>"+i+" - "+factor+" wise - "+label,  
                                                                 x=0.5, font_size=20), 
                          height=300, margin=dict(b=0))
        fig.update_xaxes(title=None)
        fig.update_yaxes(title=None)
        fig.show()
df_total["week"] = pd.to_datetime(df_total.time).dt.weekofyear
#State wise affect of Covid on Page Load events - best performing states
temp = df_total.groupby(["state","week"])[["pct_access","engagement_index"]].mean().reset_index()
#temp = temp[~temp.state.isin(["Minnesota","North Dakota"])]           # data is not complete for these states

top_10_states = temp.groupby(["state"])["engagement_index"].sum().reset_index().sort_values("engagement_index",ascending=False).head(10).state.values

temp = temp[temp.state.isin(top_10_states)]

time_plot(temp, "state", "Top 10")

If we look at the graph, we can see that Arizona is the one with the highest value in the pct_access and engagement_index variables. Therefore, it is logical to think that it is the one that will maintain the highest value over the weeks. We can see that there are three common behaviours in all the states shown:
- The first behaviour is from week 0-23: spring semester academic year 2019-2020.
- The second behaviour is from week 23-30: summer holidays.
- The third behaviour is from week 30-52: autumn semester academic year 2020-2021.
During the holidays, it is normal that internet access for study purposes decreases. 
It is also necessary to take into account different state and local holidays. In Arizona, for example, in week 12-13 there is a significant drop in the variables. The same happens in week 41-42. 
Another fact to note is that in the state of New York, students started to connect later than students in Arizona, but from week 12 onwards, the variables maintained the same behaviour. 
If we look at the type of states in the two graphs, we see that: 
- Republicans: Arizona, Indiana, Massachusetts and Ohio.
- Democrats: Illinois, Wisconsin, Connecticut, New York, New Jersey and D.Columbia.
Thus 40% of the top 10 are Republicans and 60% are Democrats. 
It is interesting to see these variations because, although Donald Trump applied secular policies to deal with the Covid-19 pandemic, one might think that Arizona's policies were ahead of the possible effects on the education sector. But if we are to be honest, this state was already accustomed to using the internet for study as the first weeks of 2020 stand out from the other states. 
Where we see a more appropriate behaviour to the events produced by the pandemic is New York. New York is a democratic state that declared tough statewide policies under the tenth amendment of the constitution by closing schools and public places since March 7th. 

First we will see how the variables evolve over time and then we will add the variables described below:

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

temp = eng.groupby("time")[["engagement_index","pct_access"]].mean().reset_index()
# Plot showing overall enegamenet index and pct _access
fig = make_subplots(rows=2, cols=1, shared_xaxes=True, vertical_spacing=0.1, 
                    subplot_titles=["<b>% Atleast 1 page Load Event <br> ", 
                                    "<b>Page Loads per 1000 Students <br> "])

fig.add_trace(go.Scatter(x=temp.time, y=temp.pct_access,marker=dict( color='rgba(055, 056, 150, 0.5)',size=20)), 
              row=1, col=1)
fig.add_trace(go.Scatter(x=temp.time, y=temp.engagement_index,marker=dict( color='rgba(055, 000, 050, 0.5)',size=20)),
              row=2, col=1)

fig.add_vrect(x0="2020-05-20", x1="2020-08-30",
              fillcolor="red", opacity=0.2, line_width=0)

fig.add_vrect(x0="2020-03-19", x1="2020-04-07",
              fillcolor="red", opacity=0.2, line_width=0)

fig.add_vline(x="2020-01-21", line_dash="dot")
fig.add_vline(x="2020-02-03", line_dash="dot")
fig.add_vline(x="2020-07-16", line_dash="dot")


fig.add_annotation(x="2020-01-21", y=400, text="First Case in USA<br>dated 2020-01-21", 
                   ax=-40, ay=-50, row=2, col=1, arrowsize=2, arrowhead=2)

fig.add_annotation(x="2020-06-25", y=1, text="Summer Break", row=1, col=1, showarrow=False, font_color="green")
fig.add_annotation(x="2020-06-25", y=350, text="Summer Break", row=2, col=1, showarrow=False, font_color="green")
fig.add_annotation(x="2020-03-30", y=480, row=2, col=1, showarrow=False, font_color="red",
                   text="Start of Lockdowns<br>in USA <br>from 2020-03-19 to 2020-04-07")
fig.add_annotation(x="2020-02-03", y=1.35, text="Public Health Emergency <br>Declared in USA<br> dated 2020-02-03", 
                   ax=80, ay=-55, row=1, col=1, arrowsize=2, arrowhead=2)
fig.add_annotation(x="2020-07-16", y=1.5, text="New Record of Daily Cases - 76,000<br>in USA dated 2020-07-16", 
                   ax=40, ay=-40, row=1, col=1, arrowsize=2, arrowhead=2)

fig.update_layout(font_family="Arial", 
                  showlegend=False, margin=dict(l=0, r=0, t=50), height=800)
fig.show()
del temp

**2. Locale:**

In [None]:
temp = df_total.groupby(["locale","time"])[["pct_access","engagement_index"]].mean().reset_index()

# pct_access and engagement_index over time among different locale
for i in ["pct_access","engagement_index"]:
    fig = px.line(data_frame=temp, x="time", y=i, color="locale")
    fig.update_layout(font_family="Arial",title=dict(text="<b>"+i+" over time among Locale", x=0.5, font_size=20),
                      height=400, 
                     legend=dict(orientation="h",yanchor="bottom", y=0.95,xanchor="right", x=1))
    fig.update_xaxes(title=None)
    fig.update_yaxes(title=None)
    fig.show()
del temp

In this graph it can be seen that cities are where there is a higher value of the variables. This result is due to different reasons:
- Cities have more broadband internet connection so there is more downloading of documents.
- Cities are more used to having a device to connect to the internet due to modern life.
However, this result contrasts with the fact that cities have the highest poverty rates.

**3. Category:**

In [None]:
temp = df_total.groupby(["pef_cat","week"])[["pct_access","engagement_index"]].mean().reset_index()

# time plot for pct_access and engagement_index locale wise
time_plot(temp, "pef_cat","All")

The variation of the two variables can be seen to follow the same pattern. The dominant main category is SDO. It is curious to see, in previous graphs, that students choose products within the LC category but instead there are more downloads from the SDO category. 

Now let's look at the main function that students have accessed `LC - Digital Learning Platforms

In [None]:
agg_digi_learn_df = df_total[df_total["Primary Essential Function"] == 'LC - Digital Learning Platforms']
agg_engagement_data = agg_digi_learn_df.groupby(["state", "time"],as_index=False)["engagement_index"].sum().reset_index()
def set_size(value):
    '''
    Takes the numeric value of a parameter to visualize on a map (Plotly Geo-Scatter plot)
    Returns a number to indicate the size of a bubble for a country which numeric attribute value 
    was supplied as an input
    '''
    result = np.log(1+value/100)
    if result < 0:
        result = 0.001
    return result

pipeline = pdp.PdPipeline([
    pdp.ApplyByCols('engagement_index', set_size, 'size', drop=False),
    pdp.MapColVals('state', us_state_abbrev)
])

agg_engagement_data = pipeline.apply(agg_engagement_data)
agg_engagement_data = agg_engagement_data.sort_values(by='time', ascending=True)

# Visualization:
fig = px.scatter_geo(
    agg_engagement_data, locations="state", locationmode='USA-states',
    scope="usa",
    color="engagement_index", 
    size='size', hover_name="state", 
    range_color= [0, 200000], 
    projection="albers usa", animation_frame="time", 
    title='Engagement Index: LC - Digital Learning Platforms', 
    color_continuous_scale=px.colors.sequential.Greys)

fig.show()

This graph confirms the three distinctive behaviours that we explain throughout the document. The period of less activity is the summer months because students do not study and the confinement was lifted so there was more life in the streets and trips and night outings were promoted for teenagers and outings to parks and gardens that were closed for children. 

In [None]:
temp = df_total.groupby(["Product Name","week"])[["pct_access","engagement_index"]].mean().reset_index()

# time plot for pct_access and engagement_index locale wise
time_plot(temp, "Product Name","All")

This graph confirms what was said in **graph number**. without the presence of time. That is, Google Docs and Google Classroom are the two most used products by students in all US states.

**1.3 Sector(s):**

In [None]:
temp = df_total.groupby(["Sector(s)","week"])[["pct_access","engagement_index"]].mean().reset_index()

# time plot for pct_access and engagement_index locale wise
time_plot(temp, "Sector(s)","All")



This graph confirms what was said without the presence of time. That is, Prek-12; Higher Ed; Corporate is the predominant sector that students have accessed during the pandemic year.

**Final conclusions:** 
1. The variables pct_access and engagement_index have a positive correlation of 0.78.
2. Arizona is a Republican state that stands out for its high pct_access and engagement_index.
3. Arizona is a state where online study was already practised in addition to traditional study, so Donald Trump's policies did not influence the education sector.
4. New York is a democratic state where restrictive policies were implemented to try to tackle the pandemic on 7 March 2020. In the graph it can be seen that in the first months, the two variables have a very small value, but from the beginning of March onwards it shoots up to become the leader of the list.
5. The three periods described throughout the document are confirmed. National and local holidays should also be taken into account to see the variability of the variables. 

Once we have all this in mind, we will see how the different categorical variables of the dataframe `district` react over time. To do this, we will group all these variables together to improve the performance of the programme:
"state",
"locale",
"pct_black/hispanic",
"county_connections_ratio",
"pct_free/reduced",
"time"

In [None]:
summary_df = df_total.groupby([
    "state",
    "locale",
    "pct_black/hispanic",
    "county_connections_ratio",
    "pct_free/reduced",
    "time"],
    as_index=False)["engagement_index"].sum().reset_index(drop=True)
summary_df = summary_df.sort_values(by='time', ascending=True)
summary_df.head()

In [None]:
fig = px.bar(summary_df,
             y="state",
             x="engagement_index",
             animation_frame="time",
             orientation='h',
             color="pct_free/reduced"
)
fig.update_layout(width=800,
                  height=600,
                  paper_bgcolor='rgba(0,0,0,0)',
                  plot_bgcolor='rgba(0,0,0,0)',
                  title_text='EI by Locale, Ethnicity of School Districs and Date',
                  showlegend=True)
fig.update_xaxes(title_text='Egagement Index (EI)')
fig.show()

This graph gives an idea of the variation of the pct_free/reduced variable over time for each state. Of all the temporal evolution, there is one fact to highlight: at the beginning of the spring 2020 academic year, each state had requested different pct_free/reduced but some were waiting or the value was so small that it did not even appear in the graph. At the end of the course (e.g. 2020-11-09) we can see that all states have asked for their subsidies and in addition many are asking for a 100% reduction of their lunch quota. This means that poverty has increased over this period of time. 

## Conclusions:

<div style="text-align: justify"> With all this analysis we have uncovered different points relevant for this study and future analyses.
Different analyses have been done to complete the datasets in the most optimal way. We have searched for information from the links provided by the sponsors of the competition. There are different aspects to take into account. At the beginning of the pandemic, when Donald Trump was still laughing about the situation, there was a considerable increase in the number of cases all over the USA. Cities where there is an index of indigenous, black and Hispanic population were more prone to the disease than the white race. This is a purely economic factor. Non-whites have lower incomes, as their jobs are cheap labour. This means that there will be more fat and carbohydrates than quality protein in their shopping baskets. This virus affects the immune system and the immune system in turn relies on the vitamins and minerals the body has to build up its defences. That is why the virus initially affected more democratic states. This fact led the rulers of these states to apply very restrictive laws and regulations, including school closures. At this time, children could not attend classes in person and this affected their school performance. People of colour and Hispanics were again affected because classes were now held online. This meant that the internet and electronic devices were available for teaching. The lower classes were severely affected. In North America, schools provide 'free/reduced' food for children to eat. This meant that in the absence of this support, vulnerable family members were increasingly separated from the state of comfort. Thus, race, connectivity, parental income and the availability of electronic devices marked a large social divide that led to frustration and suffocation for many families. Different states, seeing this situation, proposed an 'easier' system for those children with problems in accessing school materials. This made the situation a little easier but it is still too early to see to what extent this little push is effective for the education of our students. 

## Additional References:

[1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7587838/

[2] https://nces.ed.gov/programs/edge/Geographic/LocaleBoundaries#:~:text=A%20locale%20classification%20is%20a,or%20proximity%20to%20populated%20areas.

https://www.cdc.gov/mmwr/volumes/69/wr/mm6945a3.htm#_blank

https://www.paradisosolutions.com/blog/impact-covid-19-education-untape-potential-e-learning/

https://nces.ed.gov/programs/digest/2018menu_tables.asp

## Anexos:

### Annex I:

Here we will see where the data is taken from to derive the table conec_2.


In [None]:
conec = pd.read_csv('../input/elisabeth-2/conec.csv')

In [None]:
conec.head()

In [None]:
conec = conec.drop(['Unnamed: 0'],axis=1)

In [None]:

conec.info()
# replace all -9999 by NaN
conec = conec.replace(-9999, np.nan)
# we delete the columns we are not interested in
conec.drop(['non_consumer','all'],axis=1,inplace=True)
# we delete all NaN values
conec = conec.dropna()
# rename column statename by state
conec = conec.rename(columns = {'statename':'state'})
# I invoke the district dataset to see the states with connections [1,2[.
dis[dis['county_connections_ratio'] == '[1, 2['] # North Dakota
a = dis['state'].unique()
a = pd.DataFrame(a, columns=['state'])
# I remove North Dakota from the list because I want connections up to 1.
aa = a[a['state']!='North Dakota']
aa = aa.dropna()
conec_ratio_unitario = conec[conec['ratio']<=1.0]
conec_ratio_unitario = conec_ratio_unitario[conec_ratio_unitario['state']!='North Dakota']
# Let's look at the results for all selected states.
for i in aa['state']:
    print('_'*15)
    print(i ,':')
    print('_'*15)
    print(conec_ratio_unitario[conec_ratio_unitario['state']==i]['ratio'].describe())
    sns.boxplot(conec_ratio_unitario[conec_ratio_unitario['state']==i]['ratio'])
    plt.show()
    print(':'*50)

We see that there aren't values for District of Columbia. So we will average all the connections and include it in our dataframe. Now let's look at the case of the state: North Dakota.

In [None]:
print('North Dakota:' )
print(conec[(conec['state']=='North Dakota') & (conec['ratio']>1)])
print('-'*50)
conec[(conec['state']=='North Dakota') & (conec['ratio']>1)]['ratio'].describe()
print('-'*50)
dis[dis['state']=='North Dakota']

If I compare the data given by the sponsor with the data obtained from the website, we see that the sample school is one of these two possibilities.

### Annex II: pct_free/reduced

Here we will see where the data is taken from to derive the table poverty2.

Another source of data to check if my results regarding the pct_free/reduced variable were correct is through this table given in LAB. After a series of transformations we can see in red where there is more need to ask for lunch cost reduction and which race is the most disadvantaged.

In [None]:
free = pd.read_csv('../input/elisabeth-1/pct_free.csv', header=None)

In [None]:
free = free.drop([0,1],axis=0)
free = free.drop([3],axis=0)
free = free.drop([1,2,3,4,5,6],axis=1)
free.head()

In [None]:
free.columns = ['Race','Total',
                '0 to 25.0 percent',
                '25.1 to 50.0 percent',
                '50.1 to 75.0 percent',
                'More than 75.0 percent','Missing']
free = free.drop([2,4],axis=0)
free = free.drop(['Total'],axis=1)
free = free.drop([12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29],axis=0)
free = free.drop([62,63,64,65,66],axis=0)
free = free.reset_index()
free = free.drop(['index'],axis=1)
free_total = free.loc[0:7]
free_city = free.loc[8:15]
free_suburn = free.loc[16:23]
free_town = free.loc[24:31]
free_rural = free.loc[32:37]

In [None]:
free_total.head(6)

In [None]:
# Total:
free_total.columns = ['Race', '[0,0.25[', '[0.25,0.5[', '[0.5,0.75[', '[0.75,1[','Missing']
free_total = free_total.drop(['Missing'],axis=1)
free_total= free_total.set_index(['Race'])
free_total = free_total.T
free_total.columns = ['Total','White','Black','Hispanic','Asian','Pacific Islander',
                      'American Indian/Alaska Native','Two or more races']
free_total = free_total.drop(['Asian','Pacific Islander','American Indian/Alaska Native','Two or more races'],axis=1)
#free_total.drop(['Race'],axis=0, inplace=True)

# City:
free_city.columns = ['Race', '[0,0.25[', '[0.25,0.5[', '[0.5,0.75[', '[0.75,1[','Missing']
free_city = free_city.drop(['Missing'],axis=1)
free_city= free_city.set_index(['Race'])
free_city = free_city.T
free_city.columns = ['Total','White','Black','Hispanic','Asian','Pacific Islander',
                      'American Indian/Alaska Native','Two or more races']
free_city = free_city.drop(['Asian','Pacific Islander','American Indian/Alaska Native','Two or more races'],axis=1)
#free_city.drop(['Race'],axis=0, inplace=True)

# Suburb:
free_suburn.columns = ['Race', '[0,0.25[', '[0.25,0.5[', '[0.5,0.75[', '[0.75,1[','Missing']
free_suburn = free_suburn.drop(['Missing'],axis=1)
free_suburn= free_suburn.set_index(['Race'])
free_suburn = free_suburn.T
free_suburn.columns = ['Total','White','Black','Hispanic','Asian','Pacific Islander',
                      'American Indian/Alaska Native','Two or more races']
free_suburn = free_suburn.drop(['Asian','Pacific Islander','American Indian/Alaska Native','Two or more races'],axis=1)
#free_suburn.drop(['Race'],axis=0, inplace=True)

# Town:
free_town.columns = ['Race', '[0,0.25[', '[0.25,0.5[', '[0.5,0.75[', '[0.75,1[','Missing']
free_town = free_town.drop(['Missing'],axis=1)
free_town= free_town.set_index(['Race'])
free_town = free_town.T
free_town.columns = ['Total','White','Black','Hispanic','Asian','Pacific Islander',
                      'American Indian/Alaska Native','Two or more races']
free_town = free_town.drop(['Asian','Pacific Islander','American Indian/Alaska Native','Two or more races'],axis=1)
#free_town.drop(['Race'],axis=0, inplace=True)

# Rural:
free_rural.columns = ['Race', '[0,0.25[', '[0.25,0.5[', '[0.5,0.75[', '[0.75,1[','Missing']
free_rural = free_rural.drop(['Missing'],axis=1)
free_rural= free_rural.set_index(['Race'])
free_rural = free_rural.T
free_rural.columns = ['Total','White','Black','Hispanic','Asian','Pacific Islander']
free_rural = free_rural.drop(['Asian','Pacific Islander'],axis=1)
#free_rural.drop(['Race'],axis=0, inplace=True)

In [None]:
free_total.head()

In [None]:
free_total = free_total.astype('float64')
free_city = free_city.astype('float64')
free_suburn = free_suburn.astype('float64')
free_town = free_town.astype('float64')
free_rural = free_rural.astype('float64')

In [None]:
#_____________________Generic information:
plt.subplot(2,4,1)
free_total['Total'].plot(kind='pie',autopct='%1.1f%%',
                             figsize=(14, 14),
                             colors =("#0eed7a","#b51610","#0bbf63",'#04592d'))
plt.title('Pie chart Total')

plt.subplot(2,4,2)
free_total['White'].plot(kind='pie',autopct='%1.1f%%',
                             figsize=(14, 14),
                             colors =("#daf0e9","#b51610","#5fedc0",'#1fe0a3'))
plt.title('Pie chart Total White')

plt.subplot(2,4,3)
free_total['Black'].plot(kind='pie',autopct='%1.1f%%',
                             figsize=(14, 14),
                             colors =("#d1cdcd","#a19d9d","#a19d9d",'#b51610'))
plt.title('Pie chart Total Black')

plt.subplot(2,4,4)
free_total['Hispanic'].plot(kind='pie',autopct='%1.1f%%',
                             figsize=(14, 14),
                             colors =("#fab216","#c48c12",'#4d3607',"#b51610",'#babd28'))
plt.title('Pie chart Total Hispanic')
plt.show()

#__________________________City Information:
plt.subplot(2,4,1)
free_city['Total'].plot(kind='pie',autopct='%1.1f%%',
                             figsize=(14, 14),
                             colors =("#0eed7a","#0bbf63",'#04592d',"#b51610"))
plt.title('Pie chart Total City')

plt.subplot(2,4,2)
free_city['White'].plot(kind='pie',autopct='%1.1f%%',
                             figsize=(14, 14),
                             colors =("#daf0e9","#b51610","#5fedc0",'#1fe0a3','#babd28'))
plt.title('Pie chart City White')

plt.subplot(2,4,3)
free_city['Black'].plot(kind='pie',autopct='%1.1f%%',
                             figsize=(14, 14),
                             colors =("#d1cdcd","#a19d9d","#a19d9d",'#b51610','#babd28'))
plt.title('Pie chart City Black')

plt.subplot(2,4,4)
free_city['Hispanic'].plot(kind='pie',autopct='%1.1f%%',
                             figsize=(14, 14),
                             colors =("#fab216","#c48c12",'#4d3607',"#b51610",'#babd28'))
plt.title('Pie chart City Hispanic')
plt.show()

#____________________________________Suburb Information:
plt.subplot(2,4,1) 
free_suburn['Total'].plot(kind='pie',autopct='%1.1f%%',
                             figsize=(14,14),
                             colors =("#b51610","#0eed7a","#0bbf63",'#04592d','#babd28'))
plt.title('Pie chart Total Suburb')

plt.subplot(2,4,2)
free_suburn['White'].plot(kind='pie',autopct='%1.1f%%',
                             figsize=(14,14),
                             colors =("#b51610","#daf0e9","#5fedc0",'#1fe0a3','#babd28'))
plt.title('Pie chart Suburb White')

plt.subplot(2,4,3)
free_suburn['Black'].plot(kind='pie',autopct='%1.1f%%',
                             figsize=(14,14),
                             colors =("#d1cdcd","#a19d9d",'#b51610',"#a19d9d",'#babd28'))
plt.title('Pie chart Suburb Black')

plt.subplot(2,4,4)
free_suburn['Hispanic'].plot(kind='pie',autopct='%1.1f%%',
                             figsize=(14,14),
                             colors =("#fab216","#c48c12",'#4d3607',"#b51610",'#babd28'))
plt.title('Pie chart Suburb Hispanic')
plt.show()

#_______________________________________Town Information:
plt.subplot(2,4,1) 
free_town['Total'].plot(kind='pie',autopct='%1.1f%%',
                             figsize=(14,14),
                             colors =("#0eed7a","#0bbf63","#b51610",'#04592d','#babd28'))
plt.title('Pie chart Total Town')

plt.subplot(2,4,2)
free_town['White'].plot(kind='pie',autopct='%1.1f%%',
                             figsize=(14,14),
                             colors =("#daf0e9","#b51610","#5fedc0",'#1fe0a3','#babd28'))
plt.title('Pie chart Town White')

plt.subplot(2,4,3)
free_town['Black'].plot(kind='pie',autopct='%1.1f%%',
                             figsize=(14,14),
                             colors =("#d1cdcd","#a19d9d","#a19d9d",'#b51610','#babd28'))
plt.title('Pie chart Town Black')

plt.subplot(2,4,4)
free_town['Hispanic'].plot(kind='pie',autopct='%1.1f%%',
                             figsize=(14,14),
                             colors =("#fab216","#c48c12","#b51610",'#4d3607','#babd28'))
plt.title('Pie chart Town Hispanic')
plt.show()

#_________________________________________Rural Information:
plt.subplot(2,4,1) 
free_rural['Total'].plot(kind='pie',autopct='%1.1f%%',
                             figsize=(14,14),
                             colors =("#0eed7a","#b51610","#0bbf63",'#04592d','#babd28'))
plt.title('Pie chart Total Rural')

plt.subplot(2,4,2)
free_rural['White'].plot(kind='pie',autopct='%1.1f%%',
                             figsize=(14,14),
                             colors =("#daf0e9","#b51610","#5fedc0",'#1fe0a3','#babd28'))
plt.title('Pie chart Rural White')

plt.subplot(2,4,3)
free_rural['Black'].plot(kind='pie',autopct='%1.1f%%',
                             figsize=(14,14),
                             colors =("#d1cdcd","#a19d9d","#a19d9d",'#b51610','#babd28'))
plt.title('Pie chart Rural Black')

plt.subplot(2,4,4)
free_rural['Hispanic'].plot(kind='pie',autopct='%1.1f%%',
                             figsize=(14,14),
                             colors =("#fab216","#c48c12","#b51610",'#4d3607','#babd28'))
plt.title('Pie chart Rural Hispanic')
plt.show()


These graphs represent by race and locale, the percentage of students requesting the various ranges of discounts on their lunch allowance. They are represented from top to bottom as: generic information provided by Edu Lab 2019-2020, information according to different locale. And from left to right, the race of the student. 

<div style="text-align: justify"> Overall, 28.8% of students ask for a reduction of [0.25,0.5[. 37.8% of white students ask for a reduction of [0.25,0.5[, while black and Hispanic students behave similarly. That is, an average of 44.35% ask for a reduction of [0.75,1[. This means that there is a clear economic gap between the races of the students and their annual income. 
Of the 4 localities, City has the highest poverty with 41.4% asking for a reduction of [0.75,1[. If we look at the reasons for this result, it falls on non-white students. New York City, for example, has a high proportion of people of colour. 
It is noteworthy that white students show a uniformity in lunch fee reductions, ranging from [0.0.25[ (Suburb) to [0.25.5[ (City, Town and Rural). </div>