<a href="https://colab.research.google.com/github/sysphcd/PythonProgrammingforData/blob/main/Worksheets/15_1_Encoding_and_Dummy_Coding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Encoding Data  
When we encode data, we remodel a numeric column in a dataframe to be on a standard scale (0 or 1).   
For example if we had a column of BMI scores, we could encode that column so that all scores greater than or equal to 25 were recoded to the value 1 (bad) and all scores less than 25 were recoded to 0 (good).  

For example:  

` def encode_bmi(df):`       
> `if df['bmi'] >= 25:`  
> >  `return 1`   

>`else`:  
> >`return 0`  

`df["bmi"] = df.apply(encode_bmi, axis=1)`    

### Challenge 1 - prepare dataset for encoding 
---
1. Read Covid vaccination data from the `by_country` sheet in the Excel file at this link : https://github.com/lilaceri/Working-with-data-/blob/342abab10d93c4bf23b5c55a50f189f12a137c5f/Data%20Sets%20for%20code%20divisio/Covid%20Vaccination%20Data.xlsx?raw=true
2. Find out which columns have missing values
3. Remove all rows with missing data in the total_vaccination column  
4. Remove all rows with missing data in the daily_vaccinations_per_million 
4. find the median daily_vaccinations_per_million, storing this in a variable for use later     


**Test output**:  
1. dataframe is saved in a variable
2. 
```
RangeIndex: 14994 entries, 0 to 14993
Data columns (total 15 columns):
    Column                               Non-Null Count  Dtype         
                                
 0   country                              14994 non-null  object        
 1   iso_code                             14994 non-null  object        
 2   date                                 14994 non-null  datetime64[ns]
 3   total_vaccinations                   9011 non-null   float64       
 4   people_vaccinated                    8370 non-null   float64       
 5   people_fully_vaccinated              6158 non-null   float64       
 6   daily_vaccinations_raw               7575 non-null   float64       
 7   daily_vaccinations                   14796 non-null  float64       
 8   total_vaccinations_per_hundred       9011 non-null   float64       
 9   people_vaccinated_per_hundred        8370 non-null   float64       
 10  people_fully_vaccinated_per_hundred  6158 non-null   float64       
 11  daily_vaccinations_per_million       14796 non-null  float64       
 12  vaccines                             14994 non-null  object        
 13  source_name                          14994 non-null  object        
 14  source_website                       14994 non-null  object        
dtypes: datetime64[ns](1), float64(9), object(5)
memory usage: 1.7+ MB
```
3. 9011 rows × 15 columns  
4. 8815 rows * 15 columns 
5. 6.65 



In [1]:
# Task 1 : data retrieve
import pandas as pd

url = "https://github.com/lilaceri/Working-with-data-/blob/342abab10d93c4bf23b5c55a50f189f12a137c5f/Data%20Sets%20for%20code%20divisio/Covid%20Vaccination%20Data.xlsx?raw=true"
vaccination_df = pd.read_excel(url, "by_country")
print('dataframe is saved in a variable : vaccination_df')

dataframe is saved in a variable : vaccination_df


In [2]:
# Task 2 : inspect the dataframe
vaccination_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14994 entries, 0 to 14993
Data columns (total 15 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   country                              14994 non-null  object        
 1   iso_code                             14994 non-null  object        
 2   date                                 14994 non-null  datetime64[ns]
 3   total_vaccinations                   9011 non-null   float64       
 4   people_vaccinated                    8370 non-null   float64       
 5   people_fully_vaccinated              6158 non-null   float64       
 6   daily_vaccinations_raw               7575 non-null   float64       
 7   daily_vaccinations                   14796 non-null  float64       
 8   total_vaccinations_per_hundred       9011 non-null   float64       
 9   people_vaccinated_per_hundred        8370 non-null   float64       
 10  people_ful

In [4]:
# Task 3 : Remove all rows with missing data in the total_vaccination column
vaccination_df["total_vaccinations"].isnull().values.any()
cleaned_vaccination_df = vaccination_df.dropna(subset=["total_vaccinations"]) 
display(cleaned_vaccination_df)

# The code below will run and test your code to see if you have returned a series with the correct length and first row
actual = cleaned_vaccination_df['total_vaccinations'].count()
expected = 9011

if actual == expected:
  print("Test passed", actual)
else:
  print("Test failed","Should have got", expected, "got", actual)

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,0.00,0.00,,,Oxford/AstraZeneca,Government of Afghanistan,https://reliefweb.int/report/afghanistan/afgha...
6,Afghanistan,AFG,2021-02-28,8200.0,8200.0,,,1367.0,0.02,0.02,,35.0,Oxford/AstraZeneca,Government of Afghanistan,https://reliefweb.int/report/afghanistan/afgha...
22,Afghanistan,AFG,2021-03-16,54000.0,54000.0,,,2862.0,0.14,0.14,,74.0,Oxford/AstraZeneca,Government of Afghanistan,https://reliefweb.int/report/afghanistan/afgha...
44,Afghanistan,AFG,2021-04-07,120000.0,120000.0,,,3000.0,0.31,0.31,,77.0,Oxford/AstraZeneca,Government of Afghanistan,https://reliefweb.int/report/afghanistan/afgha...
59,Afghanistan,AFG,2021-04-22,240000.0,240000.0,,,8000.0,0.62,0.62,,206.0,Oxford/AstraZeneca,Government of Afghanistan,https://reliefweb.int/report/afghanistan/afgha...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14989,Zimbabwe,ZWE,2021-04-28,458013.0,388021.0,69992.0,24074.0,17860.0,3.08,2.61,0.47,1202.0,Sinopharm/Beijing,Ministry of Health,https://twitter.com/MoHCCZim/status/1388935941...
14990,Zimbabwe,ZWE,2021-04-29,477597.0,400771.0,76826.0,19584.0,17971.0,3.21,2.70,0.52,1209.0,Sinopharm/Beijing,Ministry of Health,https://twitter.com/MoHCCZim/status/1388935941...
14991,Zimbabwe,ZWE,2021-04-30,500342.0,414735.0,85607.0,22745.0,19194.0,3.37,2.79,0.58,1291.0,Sinopharm/Beijing,Ministry of Health,https://twitter.com/MoHCCZim/status/1388935941...
14992,Zimbabwe,ZWE,2021-05-01,520299.0,428135.0,92164.0,19957.0,21171.0,3.50,2.88,0.62,1424.0,Sinopharm/Beijing,Ministry of Health,https://twitter.com/MoHCCZim/status/1388935941...


Test passed 9011


In [5]:
# Task 4 : Remove all rows with missing data in the daily_vaccinations_per_million
removed_dvpm_cleaned_vaccination_df = cleaned_vaccination_df.dropna(subset=["daily_vaccinations_per_million"]) 
display(removed_dvpm_cleaned_vaccination_df)

# The code below will run and test your code to see if you have returned a series with the correct length and first row
actual = removed_dvpm_cleaned_vaccination_df['daily_vaccinations_per_million'].count()
expected = 8816

if actual == expected:
  print("Test passed", actual)
else:
  print("Test failed","Should have got", expected, "got", actual)

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website
6,Afghanistan,AFG,2021-02-28,8200.0,8200.0,,,1367.0,0.02,0.02,,35.0,Oxford/AstraZeneca,Government of Afghanistan,https://reliefweb.int/report/afghanistan/afgha...
22,Afghanistan,AFG,2021-03-16,54000.0,54000.0,,,2862.0,0.14,0.14,,74.0,Oxford/AstraZeneca,Government of Afghanistan,https://reliefweb.int/report/afghanistan/afgha...
44,Afghanistan,AFG,2021-04-07,120000.0,120000.0,,,3000.0,0.31,0.31,,77.0,Oxford/AstraZeneca,Government of Afghanistan,https://reliefweb.int/report/afghanistan/afgha...
59,Afghanistan,AFG,2021-04-22,240000.0,240000.0,,,8000.0,0.62,0.62,,206.0,Oxford/AstraZeneca,Government of Afghanistan,https://reliefweb.int/report/afghanistan/afgha...
62,Albania,ALB,2021-01-12,128.0,128.0,,,64.0,0.00,0.00,,22.0,"Oxford/AstraZeneca, Pfizer/BioNTech, Sinovac, ...",Ministry of Health,https://twitter.com/gmanastirliu/status/138856...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14989,Zimbabwe,ZWE,2021-04-28,458013.0,388021.0,69992.0,24074.0,17860.0,3.08,2.61,0.47,1202.0,Sinopharm/Beijing,Ministry of Health,https://twitter.com/MoHCCZim/status/1388935941...
14990,Zimbabwe,ZWE,2021-04-29,477597.0,400771.0,76826.0,19584.0,17971.0,3.21,2.70,0.52,1209.0,Sinopharm/Beijing,Ministry of Health,https://twitter.com/MoHCCZim/status/1388935941...
14991,Zimbabwe,ZWE,2021-04-30,500342.0,414735.0,85607.0,22745.0,19194.0,3.37,2.79,0.58,1291.0,Sinopharm/Beijing,Ministry of Health,https://twitter.com/MoHCCZim/status/1388935941...
14992,Zimbabwe,ZWE,2021-05-01,520299.0,428135.0,92164.0,19957.0,21171.0,3.50,2.88,0.62,1424.0,Sinopharm/Beijing,Ministry of Health,https://twitter.com/MoHCCZim/status/1388935941...


Test passed 8816


In [7]:
# Task 5:find the median daily_vaccinations_per_million, storing this in a variable for use later
# the function get_median() is used to calcaulate median of a column values
def get_median(df, col):
  themedian = df[col].median()
  return themedian

display(removed_dvpm_cleaned_vaccination_df.describe())
median_dailyvaccine_permillion = get_median(removed_dvpm_cleaned_vaccination_df, "daily_vaccinations_per_million")
print("Median = ",median_dailyvaccine_permillion)

# The code below will run and test your code to see if you have returned a series with the correct length and first row
actual = median_dailyvaccine_permillion
expected = 1915.5

if actual == expected:
  print("Test passed", actual)
else:
  print("Test failed","Should have got", expected, "got", actual)



Unnamed: 0,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
count,8816.0,8126.0,6143.0,7575.0,8816.0,8816.0,8126.0,6143.0,8816.0
mean,5073578.0,3255572.0,1583705.0,134756.9,114502.3,15.783548,11.432801,5.667215,3352.155626
std,20544830.0,11973020.0,6848034.0,521191.2,441402.7,23.385756,15.392827,10.062239,4980.979866
min,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,65871.5,56041.5,23734.5,2888.5,2693.5,1.48,1.36,0.63,658.0
50%,455054.5,349267.5,174780.0,15583.0,13677.5,6.65,5.05,2.35,1915.5
75%,1996336.0,1412589.0,689798.0,62498.5,52365.25,20.155,14.89,6.315,4388.0
max,275338000.0,147047000.0,104774700.0,11601000.0,7205286.0,211.08,111.32,99.76,118759.0


Median =  1915.5
Test passed 1915.5


### Challenge 2 - encode daily vaccinations 
---

Write a function to encode daily vaccinations per million, where values greater than or equal to median = 1 and values less than median = 0 

**Test output**: 

using describe()
```
count    8816.000000
mean        0.991493
std         0.091847
min         0.000000
25%         1.000000
50%         1.000000
75%         1.000000
max         1.000000
Name: daily_vaccinations_per_million, dtype: float64
```

In [8]:
# The function encode_daily_vpm() is used to encode if a column values are greater than median. If yes, return 1. If no, return 0.
def encode_daily_vpm(df, **kwds):
  median = kwds['median']
  key = kwds['key']
  if df[key] >= median:
    return 1
  else: 
    return 0

removed_dvpm_cleaned_vaccination_df['encoded'] = removed_dvpm_cleaned_vaccination_df.apply(encode_daily_vpm,axis=1,key="daily_vaccinations_per_million",median=median_dailyvaccine_permillion)
display(removed_dvpm_cleaned_vaccination_df['encoded'].describe())


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.


count    8816.000000
mean        0.500000
std         0.500028
min         0.000000
25%         0.000000
50%         0.500000
75%         1.000000
max         1.000000
Name: encoded, dtype: float64

### Challenge 3 - Encoding total vaccinations   
---
The United Kingdom has been praised for its fast vaccine rollout. 
1. Find the minimum total vaccinations for the United Kingdom 
2. Write a function to encode total_vaccinations column so that all values less than the UK's min are 0 and all values greater than or equal to the UK's min are coded as 1 
3. Display the unique countries for which total vaccinated is at the same rate or more than the UK

**Test output**:

1. 1402432.0
2. `df['total_vaccinations'].describe()` should output:
```
count    9011.00000
mean        0.29808
std         0.45744
min         0.00000
25%         0.00000
50%         0.00000
75%         1.00000
max         1.00000
Name: total_vaccinations, dtype: float64
```
3. 
```
array(['Argentina', 'Australia', 'Austria', 'Azerbaijan', 'Bangladesh',
       'Belgium', 'Brazil', 'Cambodia', 'Canada', 'Chile', 'China',
       'Colombia', 'Czechia', 'Denmark', 'Dominican Republic', 'England',
       'Finland', 'France', 'Germany', 'Greece', 'Hong Kong', 'Hungary',
       'India', 'Indonesia', 'Ireland', 'Israel', 'Italy', 'Japan',
       'Kazakhstan', 'Malaysia', 'Mexico', 'Morocco', 'Nepal',
       'Netherlands', 'Norway', 'Pakistan', 'Peru', 'Philippines',
       'Poland', 'Portugal', 'Qatar', 'Romania', 'Russia', 'Saudi Arabia',
       'Scotland', 'Serbia', 'Singapore', 'Slovakia', 'South Korea',
       'Spain', 'Sweden', 'Switzerland', 'Thailand', 'Turkey',
       'United Arab Emirates', 'United Kingdom', 'United States',
       'Uruguay', 'Wales'], dtype=object)
```




In [10]:
# Task 1 :the minimum total vaccinations for the United Kingdom
mini_total_vaccine_uk = cleaned_vaccination_df[cleaned_vaccination_df['country'] == 'United Kingdom']['total_vaccinations'].min()
print("the minimum total vaccinations for the United Kingdom = ", mini_total_vaccine_uk)

# The code below will run and test your code to see if you have returned a series with the correct length and first row
actual = mini_total_vaccine_uk 
expected = 1402432.0
if actual == expected:
  print("Test passed", actual)
else:
  print("Test failed","Should have got", expected, "got", actual)


the minimum total vaccinations for the United Kingdom =  1402432.0
Test passed 1402432.0


In [11]:
# Task 2:Write a function to encode total_vaccinations column so that all values less than the UK's min are 0 
#        and all values greater than or equal to the UK's min are coded as 1
def encode_total_vaccinations(df,**kwds):
  minvalue = kwds['minvalue']
  key= kwds['key']
  if df[key] < minvalue:
    return 0
  else:
    return 1


cleaned_vaccination_df['encode_total'] = cleaned_vaccination_df.apply(func=encode_total_vaccinations, axis=1, key='total_vaccinations', minvalue=mini_total_vaccine_uk)
display(cleaned_vaccination_df['encode_total'].describe())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':


count    9011.00000
mean        0.29808
std         0.45744
min         0.00000
25%         0.00000
50%         0.00000
75%         1.00000
max         1.00000
Name: encode_total, dtype: float64

In [12]:
# Task 3:Display the unique countries for which total vaccinated is at the same rate or more than the UK
# define condition
result_df = cleaned_vaccination_df[cleaned_vaccination_df['encode_total'] == 1]
print(result_df['country'].unique())
print("Number of countries = ", len(result_df['country'].unique()))

['Argentina' 'Australia' 'Austria' 'Azerbaijan' 'Bangladesh' 'Belgium'
 'Brazil' 'Cambodia' 'Canada' 'Chile' 'China' 'Colombia' 'Czechia'
 'Denmark' 'Dominican Republic' 'England' 'Finland' 'France' 'Germany'
 'Greece' 'Hong Kong' 'Hungary' 'India' 'Indonesia' 'Ireland' 'Israel'
 'Italy' 'Japan' 'Kazakhstan' 'Malaysia' 'Mexico' 'Morocco' 'Nepal'
 'Netherlands' 'Norway' 'Pakistan' 'Peru' 'Philippines' 'Poland'
 'Portugal' 'Qatar' 'Romania' 'Russia' 'Saudi Arabia' 'Scotland' 'Serbia'
 'Singapore' 'Slovakia' 'South Korea' 'Spain' 'Sweden' 'Switzerland'
 'Thailand' 'Turkey' 'United Arab Emirates' 'United Kingdom'
 'United States' 'Uruguay' 'Wales']
Number of countries =  59


### Challenge 4 - create new series of total vaccinations for each manufacturer
---

To create a new column in your dataframe:

`df['new_column'] = ...`

For example:

* to duplicate an existing column
  * `df['new_column'] = df['old_column']`
* to add two columns together 
  * `df['new_column'] = df['column1'] + df['column2']`
* to make a percentages column 
  * `df['new_column'] = (df['column1']/df['column1].sum()) * 100`

  
1. read data from 'by_manufacturer' sheet from Covid data 
2. find the sum of total vaccinations for each manufacturer
3. create a new column that has the total vaccinations as a percentage of the overall sum of total vaccinations 
4. find the median percentage 
5. create a new column called 'encoded_percentages' which duplicates the percentages column
6. encode the encoded_percentages column so that any values greater than or equal to the median percentage = 1 and any lesser than = 0 


**Test output**:

1.
2. 
```
vaccine
Johnson&Johnson        264839828
Moderna               5548036383
Oxford/AstraZeneca     539433203
Pfizer/BioNTech       8690461304
Sinovac                604660293
Name: total_vaccinations, dtype: int64
```
3. 
```
	location	date	vaccine	total_vaccinations	percentages
0	Chile	2020-12-24	Pfizer/BioNTech	420	0.000003
1	Chile	2020-12-25	Pfizer/BioNTech	5198	0.000033
2	Chile	2020-12-26	Pfizer/BioNTech	8338	0.000053
3	Chile	2020-12-27	Pfizer/BioNTech	8649	0.000055
4	Chile	2020-12-28	Pfizer/BioNTech	8649	0.000055
...	...	...	...	...	...
3291	United States	2021-05-01	Moderna	105947940	0.677095
3292	United States	2021-05-01	Pfizer/BioNTech	129013657	0.824504
3293	United States	2021-05-02	Johnson&Johnson	8374395	0.053519
3294	United States	2021-05-02	Moderna	106780082	0.682413
3295	United States	2021-05-02	Pfizer/BioNTech	130252779	0.832423
3296 rows × 5 columns
```
4. 0.0011110194374896931
5. 
```
location	date	vaccine	total_vaccinations	percentage_of_total	encoded_percentages
0	Chile	2020-12-24	Pfizer/BioNTech	420	0.000003	0.000003
1	Chile	2020-12-25	Pfizer/BioNTech	5198	0.000033	0.000033
2	Chile	2020-12-26	Pfizer/BioNTech	8338	0.000053	0.000053
3	Chile	2020-12-27	Pfizer/BioNTech	8649	0.000055	0.000055
4	Chile	2020-12-28	Pfizer/BioNTech	8649	0.000055	0.000055
...	...	...	...	...	...	...
3291	United States	2021-05-01	Moderna	105947940	0.677095	0.677095
3292	United States	2021-05-01	Pfizer/BioNTech	129013657	0.824504	0.824504
3293	United States	2021-05-02	Johnson&Johnson	8374395	0.053519	0.053519
3294	United States	2021-05-02	Moderna	106780082	0.682413	0.682413
3295	United States	2021-05-02	Pfizer/BioNTech	130252779	0.832423	0.832423
3296 rows × 6 columns
```
6. 
```
	location	date	vaccine	total_vaccinations	percentages	encode	encoded
0	Chile	2020-12-24	Pfizer/BioNTech	420	0.000003	0.000003	0
1	Chile	2020-12-25	Pfizer/BioNTech	5198	0.000033	0.000033	0
2	Chile	2020-12-26	Pfizer/BioNTech	8338	0.000053	0.000053	0
3	Chile	2020-12-27	Pfizer/BioNTech	8649	0.000055	0.000055	0
4	Chile	2020-12-28	Pfizer/BioNTech	8649	0.000055	0.000055	0
...	...	...	...	...	...	...	...
3291	United States	2021-05-01	Moderna	105947940	0.677095	0.677095	1
3292	United States	2021-05-01	Pfizer/BioNTech	129013657	0.824504	0.824504	1
3293	United States	2021-05-02	Johnson&Johnson	8374395	0.053519	0.053519	1
3294	United States	2021-05-02	Moderna	106780082	0.682413	0.682413	1
3295	United States	2021-05-02	Pfizer/BioNTech	130252779	0.832423	0.832423	1
3296 rows × 7 columns
```



In [14]:
# read data from 'by_manufacturer' sheet from Covid data
manu_df = pd.read_excel(url, sheet_name='by_manufacturer')
print(manu_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3296 entries, 0 to 3295
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   location            3296 non-null   object        
 1   date                3296 non-null   datetime64[ns]
 2   vaccine             3296 non-null   object        
 3   total_vaccinations  3296 non-null   int64         
dtypes: datetime64[ns](1), int64(1), object(2)
memory usage: 103.1+ KB
None


In [15]:
# find the sum of total vaccinations for each manufacturer
total_eachmanu = manu_df.groupby('vaccine')['total_vaccinations'].sum() 
print(total_eachmanu)

vaccine
Johnson&Johnson        264839828
Moderna               5548036383
Oxford/AstraZeneca     539433203
Pfizer/BioNTech       8690461304
Sinovac                604660293
Name: total_vaccinations, dtype: int64


In [16]:
# create a new column that has the total vaccinations as a percentage of the overall sum of total vaccinations
manu_df['percentages'] = manu_df['total_vaccinations']  / manu_df['total_vaccinations'].sum() * 100
display(manu_df)

Unnamed: 0,location,date,vaccine,total_vaccinations,percentages
0,Chile,2020-12-24,Pfizer/BioNTech,420,0.000003
1,Chile,2020-12-25,Pfizer/BioNTech,5198,0.000033
2,Chile,2020-12-26,Pfizer/BioNTech,8338,0.000053
3,Chile,2020-12-27,Pfizer/BioNTech,8649,0.000055
4,Chile,2020-12-28,Pfizer/BioNTech,8649,0.000055
...,...,...,...,...,...
3291,United States,2021-05-01,Moderna,105947940,0.677095
3292,United States,2021-05-01,Pfizer/BioNTech,129013657,0.824504
3293,United States,2021-05-02,Johnson&Johnson,8374395,0.053519
3294,United States,2021-05-02,Moderna,106780082,0.682413


In [18]:
# find the median percentage
median_percent = manu_df['percentages'].median()
print("median percentage = ", median_percent)

# The code below will run and test your code to see if you have a correct median percentage
actual = median_percent 
expected = 0.0011110194374896931
if actual == expected:
  print("Test passed", actual)
else:
  print("Test failed","Should have got", expected, "got", actual)

median percentage =  0.0011110194374896931
Test passed 0.0011110194374896931


In [19]:
# create a new column called 'encoded_percentages' which duplicates the percentages column
manu_df['encoded_percentages'] = manu_df['percentages']
display(manu_df)

Unnamed: 0,location,date,vaccine,total_vaccinations,percentages,encoded_percentages
0,Chile,2020-12-24,Pfizer/BioNTech,420,0.000003,0.000003
1,Chile,2020-12-25,Pfizer/BioNTech,5198,0.000033,0.000033
2,Chile,2020-12-26,Pfizer/BioNTech,8338,0.000053,0.000053
3,Chile,2020-12-27,Pfizer/BioNTech,8649,0.000055,0.000055
4,Chile,2020-12-28,Pfizer/BioNTech,8649,0.000055,0.000055
...,...,...,...,...,...,...
3291,United States,2021-05-01,Moderna,105947940,0.677095,0.677095
3292,United States,2021-05-01,Pfizer/BioNTech,129013657,0.824504,0.824504
3293,United States,2021-05-02,Johnson&Johnson,8374395,0.053519,0.053519
3294,United States,2021-05-02,Moderna,106780082,0.682413,0.682413


In [20]:
# encode the encoded_percentages column so that any values greater than or equal to the median percentage = 1 and any lesser than = 0
def encoded_percent(df, **kwds):
  key = kwds['key']
  medianpercent = kwds['medianpercent']
  if df[key] >= medianpercent:
    return 1
  else: 
    return 0
    
manu_df['encoded'] = manu_df.apply(encoded_percent, axis=1, key="encoded_percentages", medianpercent=median_percent)
display(manu_df)

Unnamed: 0,location,date,vaccine,total_vaccinations,percentages,encoded_percentages,encoded
0,Chile,2020-12-24,Pfizer/BioNTech,420,0.000003,0.000003,0
1,Chile,2020-12-25,Pfizer/BioNTech,5198,0.000033,0.000033,0
2,Chile,2020-12-26,Pfizer/BioNTech,8338,0.000053,0.000053,0
3,Chile,2020-12-27,Pfizer/BioNTech,8649,0.000055,0.000055,0
4,Chile,2020-12-28,Pfizer/BioNTech,8649,0.000055,0.000055,0
...,...,...,...,...,...,...,...
3291,United States,2021-05-01,Moderna,105947940,0.677095,0.677095,1
3292,United States,2021-05-01,Pfizer/BioNTech,129013657,0.824504,0.824504,1
3293,United States,2021-05-02,Johnson&Johnson,8374395,0.053519,0.053519,1
3294,United States,2021-05-02,Moderna,106780082,0.682413,0.682413,1


### Exercise 8 - encode daily vaccinations 

1. find the median daily vaccinations per 1 million 
2. write a function to encode daily vaccinations per 1 million, where values greater than or equal to median = 1 and values less than median = 0 

Output: 

1. 1915.5
2. 
```
0        0
6        0
22       0
44       0
59       0
        ..
14989    0
14990    0
14991    0
14992    0
14993    0
Name: daily_vaccinations_per_million, Length: 9011, dtype: int64
```

In [21]:
# find the median daily vaccinations per 1 million
median_daily_vaccine = cleaned_vaccination_df['daily_vaccinations_per_million'].median()
print("median_daily_vaccine = ", median_daily_vaccine, "\n")

# The code below will run and test your code to see if you have a correct median percentage
actual = median_daily_vaccine 
expected = 1915.5

if actual == expected:
  print("Test passed", actual)
else:
  print("Test failed","Should have got", expected, "got", actual)

median_daily_vaccine =  1915.5 

Test passed 1915.5


In [22]:
# write a function to encode daily vaccinations per 1 million, where values greater than or equal to median = 1 
# and values less than median = 0
def encode_daily_vpm2(df, **kwds):
  key = kwds['key']
  median = kwds['median_dv']
  if df[key] >= median:
    return 1
  else:
    return 0

cleaned_vaccination_df['encode_daily_vaccinations_per_million'] = cleaned_vaccination_df.apply(encode_daily_vpm2, axis=1, key="daily_vaccinations_per_million", median_dv=median_daily_vaccine)
cleaned_vaccination_df['encode_daily_vaccinations_per_million'] 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


0        0
6        0
22       0
44       0
59       0
        ..
14989    0
14990    0
14991    0
14992    0
14993    0
Name: encode_daily_vaccinations_per_million, Length: 9011, dtype: int64

### Exercise 9 - Encoding vaccinations per hundred  
---
The United Kingdom has been praised for its fast vaccine rollout. 
1. find the minimum total vaccinations for the United Kingdom 
2. save this value in a variable rounded down to an integer
3. write a function to encode total_vaccinations column so that all values less than the UK's min are 0 and all values greater than or equal to the UK's min are coded as 1 
4. display the countries which total vaccinated is at the same rate or more than the UK

Output:

1. 1402432.0
2. 1402432
3. `df['people_vaccinated_per_hundred']` should output:
```
0        0
6        0
22       0
44       0
59       0
        ..
14989    0
14990    0
14991    0
14992    0
14993    0
Name: total_vaccinations, Length: 9011, dtype: int64
```
4. 
```
array(['Argentina', 'Australia', 'Austria', 'Azerbaijan', 'Bangladesh',
       'Belgium', 'Brazil', 'Cambodia', 'Canada', 'Chile', 'China',
       'Colombia', 'Czechia', 'Denmark', 'Dominican Republic', 'England',
       'Finland', 'France', 'Germany', 'Greece', 'Hong Kong', 'Hungary',
       'India', 'Indonesia', 'Ireland', 'Israel', 'Italy', 'Japan',
       'Kazakhstan', 'Malaysia', 'Mexico', 'Morocco', 'Nepal',
       'Netherlands', 'Norway', 'Pakistan', 'Peru', 'Philippines',
       'Poland', 'Portugal', 'Qatar', 'Romania', 'Russia', 'Saudi Arabia',
       'Scotland', 'Serbia', 'Singapore', 'Slovakia', 'South Korea',
       'Spain', 'Sweden', 'Switzerland', 'Thailand', 'Turkey',
       'United Arab Emirates', 'United Kingdom', 'United States',
       'Uruguay', 'Wales'], dtype=object)
```




In [23]:
# find the minimum total vaccinations for the United Kingdom
min_total_vaccine = cleaned_vaccination_df[cleaned_vaccination_df['country'] == 'United Kingdom']['total_vaccinations'].min()
print("minimum total vaccinations UK = ",min_total_vaccine)

minimum total vaccinations UK =  1402432.0


In [24]:
# save this value in a variable rounded down to an integer
min_total_vaccine_int = int(min_total_vaccine)
print("minimum total vaccinations UK integer = ", min_total_vaccine_int)

minimum total vaccinations UK integer =  1402432


In [25]:
# write a function to encode total_vaccinations column so that all values less than the UK's min are 0 and all values greater than or equal to the UK's min are coded as 1
def encode_total_vaccine(df, **kwds):
  key = kwds['key']
  min = kwds['min']
  if df[key] >= min:
    return 1
  else:
    return 0

cleaned_vaccination_df['encode_total_lessmin'] = cleaned_vaccination_df.apply(encode_total_vaccine, axis=1, key="total_vaccinations", min=min_total_vaccine_int)
cleaned_vaccination_df['encode_total_lessmin']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.


0        0
6        0
22       0
44       0
59       0
        ..
14989    0
14990    0
14991    0
14992    0
14993    0
Name: encode_total_lessmin, Length: 9011, dtype: int64

In [26]:
cleaned_vaccination_df['people_vaccinated_per_hundred']

0        0.00
6        0.02
22       0.14
44       0.31
59       0.62
         ... 
14989    2.61
14990    2.70
14991    2.79
14992    2.88
14993    2.89
Name: people_vaccinated_per_hundred, Length: 9011, dtype: float64

In [27]:
# display the countries which total vaccinated is at the same rate or more than the UK
theresult = cleaned_vaccination_df[cleaned_vaccination_df['encode_total_lessmin']==1]['country'].unique()
print(theresult)
print("Number of countries = ", len(theresult))

['Argentina' 'Australia' 'Austria' 'Azerbaijan' 'Bangladesh' 'Belgium'
 'Brazil' 'Cambodia' 'Canada' 'Chile' 'China' 'Colombia' 'Czechia'
 'Denmark' 'Dominican Republic' 'England' 'Finland' 'France' 'Germany'
 'Greece' 'Hong Kong' 'Hungary' 'India' 'Indonesia' 'Ireland' 'Israel'
 'Italy' 'Japan' 'Kazakhstan' 'Malaysia' 'Mexico' 'Morocco' 'Nepal'
 'Netherlands' 'Norway' 'Pakistan' 'Peru' 'Philippines' 'Poland'
 'Portugal' 'Qatar' 'Romania' 'Russia' 'Saudi Arabia' 'Scotland' 'Serbia'
 'Singapore' 'Slovakia' 'South Korea' 'Spain' 'Sweden' 'Switzerland'
 'Thailand' 'Turkey' 'United Arab Emirates' 'United Kingdom'
 'United States' 'Uruguay' 'Wales']
Number of countries =  59


### Exercise 10 - create new series of total vaccinations percentages
---

To create a new column in your dataframe:

`df['new_column'] = ...`

For example:

* to duplicate an existing column
  * `df['new_column'] = df['old_column']`
* to add two columns together 
  * `df['new_column'] = df['column1'] + df['column2']`
* to make a percentages column 
  * `df['new_column'] = (df['column1']/df['column1].sum()) * 100`  
  


1. read data from 'by_manufacturer' sheet from Covid data 
2. find the sum of total vaccinations for each manufacturer
3. create a new column that has the total vaccinations as a percentage of the overall sum of total vaccinations 
4. find the median percentage 
5. create a new column called 'encoded_percentages' which duplicates the percentages column
6. encode the encoded_percentages column so that any values greater than or equal to the median percentage = 1 and any lesser than = 0 


Output:

1.
2. 
```
vaccine
Johnson&Johnson        264839828
Moderna               5548036383
Oxford/AstraZeneca     539433203
Pfizer/BioNTech       8690461304
Sinovac                604660293
Name: total_vaccinations, dtype: int64
```
3. 
```
	location	date	vaccine	total_vaccinations	percentages
0	Chile	2020-12-24	Pfizer/BioNTech	420	0.000003
1	Chile	2020-12-25	Pfizer/BioNTech	5198	0.000033
2	Chile	2020-12-26	Pfizer/BioNTech	8338	0.000053
3	Chile	2020-12-27	Pfizer/BioNTech	8649	0.000055
4	Chile	2020-12-28	Pfizer/BioNTech	8649	0.000055
...	...	...	...	...	...
3291	United States	2021-05-01	Moderna	105947940	0.677095
3292	United States	2021-05-01	Pfizer/BioNTech	129013657	0.824504
3293	United States	2021-05-02	Johnson&Johnson	8374395	0.053519
3294	United States	2021-05-02	Moderna	106780082	0.682413
3295	United States	2021-05-02	Pfizer/BioNTech	130252779	0.832423
3296 rows × 5 columns
```
4. 0.0011110194374896931
5. 
6. 
```
	location	date	vaccine	total_vaccinations	percentages	encode	encoded
0	Chile	2020-12-24	Pfizer/BioNTech	420	0.000003	0.000003	0
1	Chile	2020-12-25	Pfizer/BioNTech	5198	0.000033	0.000033	0
2	Chile	2020-12-26	Pfizer/BioNTech	8338	0.000053	0.000053	0
3	Chile	2020-12-27	Pfizer/BioNTech	8649	0.000055	0.000055	0
4	Chile	2020-12-28	Pfizer/BioNTech	8649	0.000055	0.000055	0
...	...	...	...	...	...	...	...
3291	United States	2021-05-01	Moderna	105947940	0.677095	0.677095	1
3292	United States	2021-05-01	Pfizer/BioNTech	129013657	0.824504	0.824504	1
3293	United States	2021-05-02	Johnson&Johnson	8374395	0.053519	0.053519	1
3294	United States	2021-05-02	Moderna	106780082	0.682413	0.682413	1
3295	United States	2021-05-02	Pfizer/BioNTech	130252779	0.832423	0.832423	1
3296 rows × 7 columns
```



In [None]:
# read data from 'by_manufacturer' sheet from Covid data
manu_df = pd.read_excel(url, sheet_name='by_manufacturer')
manu_df

Unnamed: 0,location,date,vaccine,total_vaccinations
0,Chile,2020-12-24,Pfizer/BioNTech,420
1,Chile,2020-12-25,Pfizer/BioNTech,5198
2,Chile,2020-12-26,Pfizer/BioNTech,8338
3,Chile,2020-12-27,Pfizer/BioNTech,8649
4,Chile,2020-12-28,Pfizer/BioNTech,8649
...,...,...,...,...
3291,United States,2021-05-01,Moderna,105947940
3292,United States,2021-05-01,Pfizer/BioNTech,129013657
3293,United States,2021-05-02,Johnson&Johnson,8374395
3294,United States,2021-05-02,Moderna,106780082


In [None]:
# find the sum of total vaccinations for each manufacturer
sum_totalvaccine = manu_df.groupby('vaccine')['total_vaccinations'].sum()
print(sum_totalvaccine)

vaccine
Johnson&Johnson        264839828
Moderna               5548036383
Oxford/AstraZeneca     539433203
Pfizer/BioNTech       8690461304
Sinovac                604660293
Name: total_vaccinations, dtype: int64


In [None]:
# create a new column that has the total vaccinations as a percentage of the overall sum of total vaccinations
manu_df['percentages'] = manu_df['total_vaccinations'] / manu_df['total_vaccinations'].sum() * 100
display(manu_df)

Unnamed: 0,location,date,vaccine,total_vaccinations,percentages
0,Chile,2020-12-24,Pfizer/BioNTech,420,0.000003
1,Chile,2020-12-25,Pfizer/BioNTech,5198,0.000033
2,Chile,2020-12-26,Pfizer/BioNTech,8338,0.000053
3,Chile,2020-12-27,Pfizer/BioNTech,8649,0.000055
4,Chile,2020-12-28,Pfizer/BioNTech,8649,0.000055
...,...,...,...,...,...
3291,United States,2021-05-01,Moderna,105947940,0.677095
3292,United States,2021-05-01,Pfizer/BioNTech,129013657,0.824504
3293,United States,2021-05-02,Johnson&Johnson,8374395,0.053519
3294,United States,2021-05-02,Moderna,106780082,0.682413


In [None]:
# find the median percentage
median_percent = manu_df['percentages'].median()
print("median_percent = ", median_percent)

median_percent =  0.0011110194374896931


In [None]:
# create a new column called 'encoded_percentages' which duplicates the percentages column
manu_df['encoded_percentages'] = manu_df['percentages']
display(manu_df)

Unnamed: 0,location,date,vaccine,total_vaccinations,percentages,encoded_percentages
0,Chile,2020-12-24,Pfizer/BioNTech,420,0.000003,0.000003
1,Chile,2020-12-25,Pfizer/BioNTech,5198,0.000033,0.000033
2,Chile,2020-12-26,Pfizer/BioNTech,8338,0.000053,0.000053
3,Chile,2020-12-27,Pfizer/BioNTech,8649,0.000055,0.000055
4,Chile,2020-12-28,Pfizer/BioNTech,8649,0.000055,0.000055
...,...,...,...,...,...,...
3291,United States,2021-05-01,Moderna,105947940,0.677095,0.677095
3292,United States,2021-05-01,Pfizer/BioNTech,129013657,0.824504,0.824504
3293,United States,2021-05-02,Johnson&Johnson,8374395,0.053519,0.053519
3294,United States,2021-05-02,Moderna,106780082,0.682413,0.682413


In [None]:
# encode the encoded_percentages column so that any values greater than or equal to the median percentage = 1 and any lesser than = 0
def encode_percentages(df, **kwds):
  key = kwds['key']
  median = kwds['median']
  if df[key] >= median_percent:
    return 1
  else:
    return 0

manu_df['encoded'] = manu_df.apply(encode_percentages, axis=1, key='encoded_percentages', median=median_percent)
display(manu_df)

Unnamed: 0,location,date,vaccine,total_vaccinations,percentages,encoded_percentages,encoded
0,Chile,2020-12-24,Pfizer/BioNTech,420,0.000003,0.000003,0
1,Chile,2020-12-25,Pfizer/BioNTech,5198,0.000033,0.000033,0
2,Chile,2020-12-26,Pfizer/BioNTech,8338,0.000053,0.000053,0
3,Chile,2020-12-27,Pfizer/BioNTech,8649,0.000055,0.000055,0
4,Chile,2020-12-28,Pfizer/BioNTech,8649,0.000055,0.000055,0
...,...,...,...,...,...,...,...
3291,United States,2021-05-01,Moderna,105947940,0.677095,0.677095,1
3292,United States,2021-05-01,Pfizer/BioNTech,129013657,0.824504,0.824504,1
3293,United States,2021-05-02,Johnson&Johnson,8374395,0.053519,0.053519,1
3294,United States,2021-05-02,Moderna,106780082,0.682413,0.682413,1


# Reflection
----

## What skills have you demonstrated in completing this notebook?

Your answer:

- Use Pandas to retrieve data from excel or csv by passing in a url link.

-  I used passing the prarameter with **kwds in encoding function to encode a column in the dataframe. 

-  use apply() method to update values in a column, 

-  Also, created a new column in a dataframe, 

-  and groupby() function to group by fields to calculate sum, min, median

## What caused you the most difficulty?

Your answer: 

- The most tricky parts are using **kwds in encode function

- use apply() method to update values in a column. 

- Groupby can be quite tricky as well. 