# Rachel Tjarksen EDA 
EDA Tasks for Rachel Tjarksen

You have been provided with two abridged datasets from FiveThirtyEight’s COVID-19 tracker. The datasets contain a list of opinion surveys fielded in the United States.  
- Dataset1 has opinion polls on the Trump administration’s handling of the coronavirus pandemic.  
- Dataset2 has opinion polls measuring respondents’ concerns, related either to the economy or infection from COVID-19.  

## Imports

In [56]:
# Add imports here
import numpy as np
import pandas


## Data Extraction

In [76]:
# Data Extraction Code from the csv files
concern_data = pandas.read_csv("covid_conern.csv")
approval_data = pandas.read_csv("covid_approval.csv")
approval_data

Unnamed: 0.1,Unnamed: 0,start_date,end_date,pollster,sponsor,sample_size,population,party,subject,tracking,text,approve,disapprove,url
0,1,2020-02-02,2020-02-04,YouGov,Economist,1500.0,a,all,Trump,False,Do you approve or disapprove of Donald Trump’s...,42.0,29.0,https://d25d2506sfb94s.cloudfront.net/cumulus_...
1,2,2020-02-02,2020-02-04,YouGov,Economist,376.0,a,R,Trump,False,Do you approve or disapprove of Donald Trump’s...,75.0,6.0,https://d25d2506sfb94s.cloudfront.net/cumulus_...
2,3,2020-02-02,2020-02-04,YouGov,Economist,523.0,a,D,Trump,False,Do you approve or disapprove of Donald Trump’s...,21.0,51.0,https://d25d2506sfb94s.cloudfront.net/cumulus_...
3,4,2020-02-02,2020-02-04,YouGov,Economist,599.0,a,I,Trump,False,Do you approve or disapprove of Donald Trump’s...,39.0,25.0,https://d25d2506sfb94s.cloudfront.net/cumulus_...
4,5,2020-02-07,2020-02-09,Morning Consult,,2200.0,a,all,Trump,False,Do you approve or disapprove of the job each o...,57.0,22.0,https://morningconsult.com/wp-content/uploads/...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2213,2214,2021-01-16,2021-01-19,American Research Group,,332.0,a,I,Trump,False,Do you approve or disapprove of the way Donald...,31.0,68.0,https://americanresearchgroup.com/economy/
2214,2215,2021-01-16,2021-01-19,YouGov,Economist,1500.0,a,all,Trump,False,Do you approve or disapprove of the way Donald...,40.0,53.0,https://docs.cdn.yougov.com/4k61xul7y7/econTab...
2215,2216,2021-01-16,2021-01-19,YouGov,Economist,293.0,a,R,Trump,False,Do you approve or disapprove of the way Donald...,80.0,16.0,https://docs.cdn.yougov.com/4k61xul7y7/econTab...
2216,2217,2021-01-16,2021-01-19,YouGov,Economist,601.0,a,D,Trump,False,Do you approve or disapprove of the way Donald...,11.0,87.0,https://docs.cdn.yougov.com/4k61xul7y7/econTab...


## Task 1 (Data Preprocessing)

Your task is to familiarise yourself with the datasets, perform any necessary cleaning, and combine them into one dataset.   

Before combining the datasets, here are a few things to keep in mind:  
1. Since we are reporting aggregate statistics, we need to be able to weigh the responses from each poll, to generate a weighted average. This means that we should disregard polls that do not report sample sizes or report missing values in the response options. By response options we mean the variables that capture responses to the opinion polls.  
2. For the purpose of this exercise, let us assume that our previous data indicates that a survey that has lower than 200 responses will be statistically underpowered, so please report results only for eligible studies.  
3. Please ensure that the final dataset includes indicator variables that clearly mark which of the two original datasets the row originated from.

In [58]:
# Preprocessing for Dataset 1
concern_text = concern_data["text"].dropna().to_numpy().reshape(-1,1)
concern_filt = concern_data[concern_data['sample_size'] >= 200]
concern_y = concern_filt[["very","somewhat","not_very","not_at_all"]].dropna().to_numpy()

In [59]:
# Preprocessing for Dataset 2
approval_text = approval_data["text"].dropna().to_numpy().reshape(-1,1)
approval_filt = approval_data[approval_data['sample_size'] >= 200]
approval_y = approval_filt[["approve","disapprove"]].dropna().to_numpy()

## Task 2 (Simple Data Representation)

Following the completion of Task 1, please do the following subtasks:  
1. A summary table to display the frequency of approval polls and concern polls conducted by each pollster.    
2. Aggregated proportions of respondents approving of Donald Trump’s performance, delineated by whether the poll was asked to Democrats, Republicans or Independents.    
3. Aggregated proportions of respondents who are very worried or somewhat worried about the economy.  
4. Aggregated proportions of respondents who are very worried or somewhat worried about the infection from the coronavirus.  

In [60]:
# Code for Task 2.1

In [61]:
# Code for Task 2.2
trump_filt = approval_data.groupby(["approval"]).get_group("concern-economy")[["very", "somewhat"]]

In [77]:
# Code for Task 2.3
# use groupby in pandas
# use weighted means in numpy

econ_filt = concern_data.groupby(["subject"]).get_group("concern-economy")[["very", "somewhat"]]
econ_filt

Unnamed: 0,very,somewhat
0,19.0,33.0
1,26.0,32.0
3,23.0,32.0
7,22.0,35.0
11,32.0,37.0
...,...,...
618,61.0,26.0
619,53.0,33.0
620,59.0,28.0
622,58.0,27.0


In [78]:
# Code for  Task 2.4
infection_filt = concern_data.groupby(["subject"]).get_group("concern-infected")[["very", "somewhat"]]
infection_filt

Unnamed: 0,very,somewhat
2,13.00,26.00
4,11.00,24.00
5,11.00,28.00
6,22.00,23.00
8,22.00,21.00
...,...,...
625,33.09,36.55
626,30.00,31.00
627,34.00,35.00
628,28.00,32.00


## Task 3 (Graphical Analysis)

Using a plotting library of your choice, plot the graphs that depict the following:  
  - **Change in Approval Rating**: line graphs of the change in approval rating over time, seperated by different parties.  
  - **Change in Economic Concern**: line graphs of the change in level of economic concern over time, seperated by levels of concern.  
  - **Change in Covid-19 Concern**: line graphs of the change in level of Covid-19 concern over time, seperated by levels of concern.   
  - **Aggregate Approval Rating**: A histogram of the weighted-average (aggregated) approval rating by party.  

## Task 4 (Data Analysis)

Using Scikit-Learn train models that can do the following following tasks:  
  - **Approval Rating**: Train a model that can take a date and party affiliation to predict what the approval rating would be.  
  - **Covid and Economic Concern**: Train a model that can predict the concern breakdown using a given time of poll. 
  - **Party Affiliation**: Given different samples of polling data, train a model that can predict whether a group of people with those preferences are democratic, republican xor independent.  
 

## Task 5 (Extra Task)