# Introduction

**Hello all. Thank you for viewing my notebook! I am a rookie as of now when it comes to Data Science. My code might not be the cleanest, so I try to explain my thought process as I go along in this notebook**

**If you find this notebook useful, feel free to give me an upvote. Also, if you have any suggestions on how I can improve my way of thinking or code, please do share!**

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
init_notebook_mode(connected=True)
cf.go_offline()

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Import Data

**Here we will import the data into a DataFrame**

In [None]:
df = pd.read_csv('/kaggle/input/covid-world-vaccination-progress/country_vaccinations.csv')

In [None]:
df

**Our goal is to track the progress of the COVID-19 Vaccination campaign across the globe.**

**We can use the `.groupby()` method off of the current df DataFrame to group the data based off a given column. For example, it might be worthwhile to group the data by country or by the date.**

In [None]:
country_df = df.groupby('country').sum()
country_df.reset_index(inplace=True)
country_df

In [None]:
date_df = df.groupby('date').sum()
date_df.reset_index(inplace=True)
date_df

**We can see that when we use the `.groupby()` method with `.sum()` we only get the columns that contained numerical data back, which makes sense.**

**The country_df DataFrame will help us see all the data from the each country for every day in which data was trakced.**

**The date_df DataFrame will help us see all the data from each date for every body in the world since we grouped by date and not by country.**

# Cumulative Data Visualization

**Below, we graphed the cumulative amount of vaccinations distibuted in the the world at on a given date, the amount of people in the world that have gotten vaccinated and the amount of people that were fully vaccinated in the world on a given day.**

In [None]:
fig = go.Figure(data=[
    go.Scatter(
        x = date_df['date'],
        y = date_df['people_vaccinated'],
        name = 'Total People Vaccinated'
    ),
    go.Scatter(
        x = date_df['date'],
        y = date_df['people_fully_vaccinated'],
        name = 'People Fully Vaccinated'
    ),
    go.Scatter(
        x = date_df['date'],
        y = date_df['total_vaccinations'],
        name = 'Total Vaccinations in Circulation'
    )

])

fig.update_layout(title = 'Vaccines Distributed and Administered',
                 xaxis_title = 'Date',
                 yaxis_title = 'Count')
fig.show()

**Hmmm...Since we have the columns *people_vaccinated*, *people_fully_vaccinated* and *total_vaccinations* have cumulative data, we expect the lines to always be increasing or not changing, not declining!**

**Below, we can see that there is missing data for these three columns. We should fill in this data**

**To do so, we will want to replace the missing data with the previous date data. To do this we will need the same dates for every country**

In [None]:
df.isnull().sum()

**This is clearly skewing our data. Let's try to fill in those NaN values. If a country reports NaN for the people vaccinated in their country on a given day, we should use the previous day's reported value.**

In [None]:
'''
This function will take in a given text and will return 0 if the data is missing.
If data isn't missing, return the data
'''

def change_nan(text):
    
    if np.isnan(text) == True:
        return 0
    else:
        return text

df['people_vaccinated'] = df['people_vaccinated'].apply(change_nan)
df['total_vaccinations'] = df['total_vaccinations'].apply(change_nan)
df['people_fully_vaccinated'] = df['people_fully_vaccinated'].apply(change_nan)

**To get the whole range of data in the DataFrame, we will use pandas built in function `.date_range()`**

In [None]:
print('Earliest Date in Data: ', df['date'].min())
print('Latest Date in Data: ', df['date'].max())
date_range = pd.date_range(df['date'].min(), df['date'].max())
date_range

In [None]:
# Create a new DataFrame, df1
df1 = pd.DataFrame(columns=df.columns)

# x will be the index location for the new DataFrame, df1. 
# The size of df1 will always be len(date_range) * len(df.country.unique())
x = 0

# df_index will be used to grab the index location of df, the DataFrame with raw data
df_index = 0

# Iterates through every unique country in df
for country in list(df.country.unique()):

    # Iterates through the length of date_range
    for date in range(0, len(date_range)):
        
        # If the date (from date_range) is in the df DataFrame (.iloc[:,2] will grab the df date),
        # we will add the data that is in df, into df1
        
        if str(date_range[date].date()) in list(df[df['country'] == country].iloc[:,2].values):

            df1.loc[x] = [country, df.iloc[df_index].iso_code, df.iloc[df_index].date,
                          df.iloc[df_index].total_vaccinations,df.iloc[df_index].people_vaccinated,df.iloc[df_index].people_fully_vaccinated,
                          df.iloc[df_index].daily_vaccinations_raw,df.iloc[df_index].daily_vaccinations, df.iloc[df_index].total_vaccinations_per_hundred,
                          df.iloc[df_index].people_vaccinated_per_hundred, df.iloc[df_index].people_fully_vaccinated_per_hundred, df.iloc[df_index].daily_vaccinations_per_million,
                          df.iloc[df_index].vaccines, df.iloc[df_index].source_name, df.iloc[df_index].source_website]
            df_index += 1
            
        # If the date (from date_range) is NOT in the df DataFrame (.iloc[:,2] will grab the df date),
        # we will add country info and the date that was missing from df into new df1. Zeros will be added for the other columns
        else:

            df1.loc[x] = [country,0, str(date_range[date].date()),0,0,0,0,0,0,0,0,0,0,0,0]
                
        x+=1

In [None]:
df1

In [None]:
# Loops through all of the unique coutries in new DataFrame, df1

for country in list(df1.country.unique()):
    
    # Grabs all of the indexes for the given country the above loop is on
    index = df1[df1['country'] == country].index
    
    start = True
    
    for i in index:
        
        # If we are on the first index for a given country, there is no previous data to grab if entry is zero,
        # so we will ignore this case 
        if start == True:
            
            start = False
            
            continue
        
        # If people_vaccinated at given index location is 0, then grab the previous index people_vaccinated data
        if int(df1.loc[i, 'people_vaccinated']) == 0:
            
            df1.loc[i, 'people_vaccinated'] = df1.loc[i-1, 'people_vaccinated']
            
        # If total_vaccinations at given index location is 0, then grab the previous index total_vaccinations data            
        if int(df1.loc[i, 'total_vaccinations']) == 0:
            
            df1.loc[i, 'total_vaccinations'] = df1.loc[i-1, 'total_vaccinations']
            
        # If people_fully_vaccinated at given index location is 0, then grab the previous index people_fully_vaccinated data    
        if int(df1.loc[i, 'people_fully_vaccinated']) == 0:
            
            df1.loc[i, 'people_fully_vaccinated'] = df1.loc[i-1, 'people_fully_vaccinated']


**Now we can see that there is no missing data in the columns *people_vaccinated*, *people_fully_vaccinated* and *total_vaccinations***


In [None]:
df1.isnull().sum()

**Now we can get a clearer picture of the cumulative data**

In [None]:
date_df = df1.groupby('date').sum()
date_df.reset_index(inplace=True)

fig = go.Figure(data=[
    go.Scatter(
        x = date_df['date'],
        y = date_df['people_vaccinated'],
        name = 'Total People Vaccinated'
    ),
    go.Scatter(
        x = date_df['date'],
        y = date_df['people_fully_vaccinated'],
        name = 'People Fully Vaccinated'
    ),
    go.Scatter(
        x = date_df['date'],
        y = date_df['total_vaccinations'],
        name = 'Total Vaccinations Distributed'
    )

])

fig.update_layout(title = 'Vaccines Distributed and Administered',
                 xaxis_title = 'Date',
                 yaxis_title = 'Count')
fig.show()

# Vaccine Usage

**Now let's explore what countires are using what vaccines**

**The raw data contains one vaccine column. In this column, there could be more than one vaccine reported. We will need to separate this out.**

In [None]:
# Get the data in df1 in which there is data under the vaccine column and only grab country and vaccines columns
# Could have also grabed the data from the original DataFrame, df. It will be the same

vaccine_df = df1[df1['vaccines'] != 0].reset_index().drop('index', axis=1)[['country', 'vaccines']]
vaccine_df

In [None]:
'''
Grab all the vaccine used for every row
'''

vaccine_list = []

# Iterates through all the rows in vaccine_df in the vaccines columns. Split on ','
for i in vaccine_df.vaccines.str.split(','):
    
    # Take leading whitespace away and append ot the vaccine_list
    for x in i:
        vaccine_list.append(str(x.lstrip()))

# Grab all the unique values out vaccine_list
vaccines = list(set(vaccine_list))
vaccines

In [None]:
# Create new DataFrame with the vaccines set found above
vaccine = pd.DataFrame(columns=vaccines)
# Attatch this to the df2 DataFrame
df2 = pd.concat((vaccine_df, vaccine))
df2.fillna(0,inplace=True)
df2

In [None]:
'''
This might be a little tricky to understand and there might be a better way to do this. Please feel free to share
if you have any suggestions

We want to fill in the newly added vaccine columns with a 1 to represent if a certain country used that vaccine
'''

# Iterate through all the index positions in df2
for x in range(0, len(df2)):
    
    # Grab the 'vaccines' column in df2 and split it on ','
    vaccines = df2.iloc[x]['vaccines'].split(',')
    
    # Iterate through each vaccine found in the 'vaccines' columns after splitting on ','
    for vaccine in vaccines:
        
        # Intiate empty list. Will be reset every time
        vac_list = []
        
        # Remove leading whitepace from vaccine and append to vac_list (at any given point, vac_list will only have 
        # one value in it)
        vac_list.append(vaccine.lstrip())
        
        # If value in vac_list, which will be a eqaul to one of the newly added vaccine columns in df2,
        # add 1 to that index postion (x) and column position (vaccine_column)
        for vaccine_column in list(df2.columns[2:]):
            if vac_list[0] in vaccine_column:
                df2.at[x, vaccine_column] = 1

df2

In [None]:
# Grab the top 10 countries that have the most fully vaccinated people for the most recent date in the data (df['date'].max()])
top_fullyvaccinated = df1[df1['date'] == df['date'].max()].sort_values(['people_fully_vaccinated'], ascending=False).head(10).country.tolist()
top_fullyvaccinated

**Now let's take a look at the top 10 countries with people that are fully vaccinated**

**Here we can see that, the way the data is imported, if a country uses a certain vaccine on at least one day, it will be reported in all of the days there is data. This must be the way the data is scraped. Regardless, this will give us insight into the variety and popularity of certain vaccines**

In [None]:
df2.groupby('country').sum().loc[top_fullyvaccinated].reset_index()

**Let's look at the top three wildly used and known vaccines for the countries in top_fullyvaccinated list. That is Pfizer/BioNTech, Oxford/AstraZeneca and Moderna**

In [None]:
data1 = df2.groupby('country').sum().loc[top_fullyvaccinated].reset_index()

fig = go.Figure(data=[
    go.Bar(
        x = data1['country'],
        y = data1['Pfizer/BioNTech'],
        name = 'Pfizer/BioNTech'
    ),
    go.Bar(
        x = data1['country'],
        y = data1['Oxford/AstraZeneca'],
        name = 'Oxford/AstraZeneca'
    ),
    go.Bar(
        x = data1['country'],
        y = data1['Moderna'],
        name = 'Moderna'
    )
    
])

fig.update_layout(title = 'Top Vaccine Usage per Country',
                 xaxis_title = 'Country',
                 yaxis_title = 'Count')

fig.show()

**We can look at all the vaccines used in every country with the plot below. It is messy, but gives us a good overal picture of what vaccines are being used.**

In [None]:
df2.groupby('country').sum().iplot(kind='bar')

**Thank you for looking at my code. Like I said, if you have any comments, please share!**

**I will continue working on this notebook and upload my findings!**