![](https://images.theconversation.com/files/341551/original/file-20200612-153812-ws3rqu.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=754&fit=clip)

Hi Welcome to my EDA Task about COVID vaccine distribution. For this task we have been asked to answer two questions by exploring the dataset:

1. (A) a list of the top 10 states that distributed the most vaccines total.
2. (B) a list of the top 10 states that distributed the most vaccines per capita -- for the past 7 days.

The dataset seems to be being updated quite regularly so my answers may differ from other submissions.

* [Import Libraries and Load dataset.](#1)
* [Explore Data](#2)
* [Q1 Working Out](#3)
* [Q1 Answer](#4)
* [Q2 Working Out](#5)
* [Q2 Answer](#6)


<a id = "1"></a><br>
# Import the neseccary

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

In [None]:
covid = pd.read_csv("../input/usa-covid19-vaccinations/us_state_vaccinations.csv")

<a id = "2"></a><br>
# Explore Dataset

In [None]:
covid.head()

In [None]:
covid.describe()

In [None]:
covid.dtypes

In [None]:
covid.shape

In [None]:
round((covid.isnull().sum()/covid.shape[0])*100,2)

In [None]:
covid.iloc[1]

In [None]:
covid.iloc[2]

In [None]:
covid.iloc[3]

So theres some missing data. There doesnt seem to be much of a reason behind the one I've looked at. I've brought up some rows from Alabama to try and understand what impact the missing data has and if theres any rhyme or reason to fill it with something or if it's fine to leave it out.

The highest missing values are the following:

1. people_fully_vaccinated_per_hundred    13.94%
2. people_vaccinated_per_hundred          12.39%
3. distributed_per_hundred                12.15%
4. total_vaccinations_per_hundred         11.89%



In [None]:
covid.columns

In [None]:
covid.groupby(["location"]).sum()

In [None]:
fake_states = covid.location.unique()
fake_states = fake_states.tolist()
fake_states

In [None]:
len(covid.location.unique())

It seems that there are a few more "states" listed than are actually applicable according to a quick search there are 50 US states. Fun Fact Washington DC is actually a federal district. I've put together a list of the actual US states and will create a dataframe to continue working on from. 

https://uk.usembassy.gov/states-of-the-union-states-of-the-u-s/#:~:text=There%20are%20fifty%20(50)%20states,under%20the%20authority%20of%20Congress.

In [None]:
real_states = ["Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida", "Georgia", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky",
               "Louisiana", "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina",
               "North Dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia",
               "Wisconsin", "Wyoming"]


if len(real_states) == 50:
    print("You're on the ball, keep going tiger")


In [None]:
coviddf = covid.location.isin(real_states)
filtered = covid[coviddf]
filtered.head()

In [None]:
filtered.location.value_counts()

In [None]:
filtered.date.value_counts()

In [None]:
filtered.date.value_counts().shape[0]

In [None]:
filtered.shape[0] / filtered.date.value_counts().shape[0]

<a id = "3"></a><br>
# Question 1 Working Out

So far this is quite good news. There are 78 unique date stamps repeated 49 times across all columns so the data is nicely shaped.

Time to move on to trying to answer one of the questions.

In [None]:
filtered.head()

In [None]:
QA = filtered[["location", "total_distributed"]]
QA

In [None]:
QA.sort_values(by="total_distributed").head(10)

By using max() in a groupby i will be able to grab the highest and therefore latest values in the timeseries of events. Naturally if there are 10 more vaccines distributed than yesterday, the latest value will always be the highest, it wont be able to go down.

In [None]:
Q2 = QA.groupby(by="location").max().sort_values(by="total_distributed",ascending=False)
Q2.head(10)

<a id = "4"></a><br>
# Q1 Answer
Answering the question: "(A) a list of the top 10 states that distributed the most vaccines total."
Yields this list: 
1. California
2. Texas
3. Florida
4. Pennysylvania
5. Illinois
6. Ohio
7. North Carolina
8. Georgia
9. Michigan
10. New Jersey

<a id = "5"></a><br>
# Q2 Working Out
In order to find the top ten states of distribution per capita, i can use the distribution per 100 column and divide it by 100 in order to arrive at the per capita amount. 
Realistically, dividing it by 100 or just leaving it as it is will still yield the same top 10 results. I will however do the division as i will be able to generate a per person value.

In [None]:
filtered.head()

In [None]:
QB = filtered[["location", "distributed_per_hundred"]]
QB

In [None]:
Q3 = QB.groupby(by="location").max().sort_values(by="distributed_per_hundred",ascending=False)
Q3.head(10)

Answering the question: "2. (B) a list of the top 10 states that distributed the most vaccines per capita -- for the past 7 days." wouldnt yield the list above.  

I have missed out the significant part of "past 7 days". This list is still interesting and I will be able to compare it to the final answer when I have removed all dates older than 7 days old in the dataset.

In [None]:
filtered.date.tail(8)

In [None]:
moo = filtered[(filtered['date'] > "2021-04-09") ]
moo.shape

In [None]:
Q4 = moo[["location", "distributed_per_hundred"]]
Q4 = Q4.groupby(by="location").max().sort_values(by="distributed_per_hundred",ascending=False)
Q4.head(10)

Strangely enough they seem to be exactly the same. It didnt take me long to realise what I need to do in order to make it right.

I need to create a seperate dataframe withe max values for before 7 days ago. I could then merge dataframes with Q4 and then perform an aggregate and subtract one from the other to find out who has distributed the most in the last 7 days.



In [None]:
foo = filtered[(filtered['date'] < "2021-04-09") ]
foo.shape

In [None]:
foo.date

In [None]:
Q5 = foo[["location", "distributed_per_hundred"]]
Q5 = Q5.groupby(by="location").max().sort_values(by="distributed_per_hundred",ascending=False)
Q5.head(10)

In [None]:
bar = Q4 - Q5
bar.head()

In [None]:
bar.sort_values(by="distributed_per_hundred", ascending=False).head(10)

So the final list in in, I'll make a table to compare the two.

| Wrong List      | Right List |
| ----------- | ----------- |
| Alaska     | Conneticut      |
| Conneticut   | Vermont        |
| Vermont        | Rhode Island   |
| South Dakota       | Massachussetts |
| New Mexico     | California  |
| Hawaii       | Maryland    |
| Oklahoma     | Maine      |
| Massachussetts  | Delaware    |
| Maine     | New Hampshire |
| Rhode Island  | New Mexico  |

<a id = "6"></a><br>
# Q2 Answer
So to answer the second question: "2. (B) a list of the top 10 states that distributed the most vaccines per capita -- for the past 7 days." The top 10 states are as follows:
1. Conneticut
2. Vermont
3. Rhode Island
4. Massachussetts
5. California
6. Maryland
7. Maine
8. Delaware
9. New Hampshire
10. New Mexico




Thanks for reading through my Task submission. I've leaerned quite a few things about notebooks with this particular exercise. If anyone notices anything wrong or untowards, please make a comment and I'll do my best to reply and fix my mistake.