# Data exploration 

---

Group name: Group G

---


## Introduction

The Data we chose for this exercise is called "covid_approval_polls" and is about the opinion of different groups in America concerning Trump and Bidens handling of the coronavirus outbreak. The question is wether they approve or disapprove of the handling. The samples are categorized by several different categories like for example which party they belong to or if they're a registered voter etc. 

## Setup

In [2]:
import pandas as pd
import altair as alt
import numpy as np

alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

## Data

## Import data

In [3]:
df = pd.read_csv("../data/external/data.csv", on_bad_lines='skip')

#### CHART 1

### Data structure

In [4]:
df['party'] = df['party'].astype("category")
df['sample_size'] = df['sample_size'].astype("category")

### Data corrections

## Exploratory data analysis

In [5]:
chart = alt.Chart(df).mark_bar(size = 40).encode(
    x=alt.X('party',
            sort='-y',
            axis=alt.Axis(title="Party", 
                          titleAnchor="middle", 
                          labelAngle=0)),
    y=alt.Y('count(party)', 
            axis=alt.Axis(title = "Count", 
                          titleAnchor="middle")),
    color=alt.Color('party', scale=alt.Scale(scheme='pastel2'))
).properties(
    title='Count of party members questioned ',
    width=350,
    height=250
).configure_title(
    fontSize=15,
    font='Arial',
    anchor='middle',
    color='black'
)

chart

  for col_name, dtype in df.dtypes.iteritems():


We have decided to use a simple bar plot here because the information shown in the chart is also rather simple. He've also thought about a pie chart but decided against it because he thought that in a bar plot it's easier to see that "Repuplicans", "Democrats" and "Independent" were questioned pretty much equally while "All" has a slightly higher count. (The legend could technically be removed here because the information is already readable in the chart. We have added it because of personal preferences.)

#### CHART 2 (Interactive)

### Data structure

In [6]:
df['population'] = df['population'].astype("category")

## Exploratory data analysis

In [7]:
source = pd.DataFrame(df.population.value_counts())

In [8]:
source = source.reset_index()

In [9]:
source.rename(columns={"index": "Population", "population": "count"}, inplace=True)

In [10]:
source

Unnamed: 0,Population,count
0,a,2308
1,rv,1396
2,lv,162
3,v,1


In [11]:
chart = alt.Chart(source).mark_arc(innerRadius=30).encode(
    theta=alt.Theta("count:Q", stack=True), 
    color=alt.Color("Population:N"),
    tooltip=["count", "Population"]
).properties(
    height=300, width=300,
    title="Questioned Population"
)


pie = chart.mark_arc(outerRadius=120)
legend = chart.mark_text(radius=140, size=20).encode(text="Population:N")



pie + legend

  for col_name, dtype in df.dtypes.iteritems():


We have decided to use a pie chart and not a bar plot again because the information shown in the chart is similar to the first one but the differences in the values is much greater, which makes it easier to see on a pie chart. Here you can see that most of the people that were questioned where adults (over 50%), a small part were likely voters and the rest were registered voters. (The legend could technically be removed here because the information is already readable in the chart. We have added it because of personal preferences.)

#### CHART 3 (Interactive)

### Data structure

In [12]:
df.subject = df.subject.astype("category")
df.party = df.party.astype("category")

## Exploratory data analysis

In [78]:
chart = alt.Chart(df).mark_bar(size = 60).encode(
    x=alt.X('subject',
            sort='-y',
            axis=alt.Axis(title="Subject", 
                          titleAnchor="middle", 
                          labelAngle=0)),
    y=alt.Y('count(subject)', 
            axis=alt.Axis(title = "Count", 
                          titleAnchor="middle")),
    color= alt.Color('party', 
                     legend=alt.Legend(title="Party"), scale=alt.Scale(scheme='tableau20')),
    tooltip=["subject", "party", "count(subject)"]
).interactive(
).properties(
    title='How many people of which party were questioned for each Trump and Biden?',
    width=350,
    height=250
).configure_title(
    fontSize=15,
    font='Arial',
    anchor='middle',
    color='orange'
)

chart

For this chart we've decided to use a stacked plot bar because we thought it'd be a good way to see if there's a difference in the division of the questioned people belonging to the different parties that were questioned regarding Trump or Biden. It's interesting because we thought that maybe more Repuplicans could have been questioned for Trump but as the plot shows the parties are preety much splitted equally which indicates that the survey isn't biased because all parties give their opinion on the subjects equally. 

#### CHART 4 (Interactive)

### Data structure

In [19]:
df.pollster = df.pollster.astype("category")
df.subject = df.subject.astype("category")

## Exploratory data analysis

In [85]:
alt.Chart(df).mark_bar(
    cornerRadiusTopLeft=3,
    cornerRadiusTopRight=3,
    size=10
).encode(
    x=alt.X('pollster:O',
            axis=alt.Axis(title="Pollster", 
                          titleAnchor="middle")),
    y=alt.Y('count(subject):Q', 
            axis=alt.Axis(title = "Count(Subject)", 
                          titleAnchor="middle")),
    color=alt.Color('subject', 
                     legend=alt.Legend(title="Subject"), scale=alt.Scale(scheme='dark2')),
    tooltip=[ "pollster", "subject", "count(subject)"]
).interactive(
).properties(
    title='How many polls did the pollsters carry out for each Trump and Biden?',
    width=900,
    height=550
).configure_title(
    fontSize=25,
    font='New Times Roman',
    anchor='middle',
    color='orange'
)