<a href="https://colab.research.google.com/github/tgalbaugh/D2KC2/blob/main/Data_2_KC_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to Knowledge Check 2!

We're going to clean some data today. I would guess about 70% of data science is cleaning and wrangling data, so we want to make sure you can do it. I haven't looked through this data set at all, so we'll see if we can find some issues to clean up. Here's what you need to do to complete the knowledge check: 

1. Make a .py (or .ipynb) file that contains the following (your choice of editor does not matter!) and do the following: 
- find and access a data set in any way you want. You can use an API, a CSV, anything. 
- Fix TWO issues with the data set using techniques you've learned in class. Here are some common fixes: 
  - Remove null values 
  - Fill in null values with 0's or blanks 
  - fill in blanks 
  - fix character strings that aren't formatted correctly (you could use regex for this) 
  - correct column names if they're misnamed 
  - correct spelling (for example, you might have a Country column with an entry that says "Unted States of America".) 
  -  There are hundreds of other things you could fix depending on the issues, so don't worry about whether or not your fix "counts" for the check. It most likely will if you're fixing something.
2. Commit your changes.
3. Push your changes to GitHub, and make sure to turn in the GitHub link into Google Classroom.


## Cleaning Data

Data taken from [Louisville Metro Data](https://data.louisvilleky.gov/dataset/parking-citations). This parking data set from 2018 looked interesting and I haven't looked at it before so let's get started. The interesting thing about data analysis is that you don't have to look at just raw science topics, you can look at sociological data to answer questions about your community and area. For example, this data set is about parking citations. If there were a ton of citations clumped together geographically, that might inform your decision to add more parking to the area.

In [61]:
import pandas as pd
import numpy as np 

In [62]:
df = pd.read_csv('https://storage.googleapis.com/kagglesdsdata/datasets/2513794/4266054/Covid%20Live.csv?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20221106%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20221106T153806Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=1089fbf06815cf61c7c10392cfd20074b8a5dffe2b4afc815b9265181c4cffad1eaba89ed38c1f6322f88292d1867f7ad8336ad3579f1335b0fee4e06edbc40f2b8de2ec73f7c0d022ba2627c6e88f8b459b8063289561fde819c44780931f1dacf6449c6b8f972d4af923d12805cfd03c656b9c412c03d10e5d81f2208ad9c818d85938add254cd3f312360dad47e5efd9ea92ea95c6fc96ef06f01c7f0ffaa15aa43c52fa3aebcba46763b8cf7f9afb7572621855ff4bb91fc7d58f6100fdfb32d1368225da6709323b286097d29b34113ff6fd03c450f66436d9b37ce805193b4db1eae9d1a3d558357292be66eb9b4d13a1660711d8a19429d378d94faa5')

I usually check the shape of the data and print the first few rows so I know what I'm looking at. Maybe the columns as well. 

In [63]:
print(df.shape)
df.head()

(230, 13)


Unnamed: 0,#,"Country,\nOther",Total\nCases,Total\nDeaths,New\nDeaths,Total\nRecovered,Active\nCases,"Serious,\nCritical",Tot Cases/\n1M pop,Deaths/\n1M pop,Total\nTests,Tests/\n1M pop,Population
0,1,USA,98166904,1084282,,94962112,2120510,2970,293206,3239,1118158870,3339729,334805269
1,2,India,44587307,528629,,44019095,39583,698,31698,376,894416853,635857,1406631776
2,3,France,35342950,155078,,34527115,660757,869,538892,2365,271490188,4139547,65584518
3,4,Brazil,34706757,686027,,33838636,182094,8318,161162,3186,63776166,296146,215353593
4,5,Germany,33312373,149948,,32315200,847225,1406,397126,1788,122332384,1458359,83883596


Fix table by removing unneeded columns

In [64]:
df = df.drop(columns=['#', 'New\nDeaths', 'Total\nRecovered', 'Active\nCases', 'Serious,\nCritical', 'Tot Cases/\n1M pop', 'Total\nTests', 'Tests/\n1M pop'], axis=1)
df.head()

Unnamed: 0,"Country,\nOther",Total\nCases,Total\nDeaths,Deaths/\n1M pop,Population
0,USA,98166904,1084282,3239,334805269
1,India,44587307,528629,376,1406631776
2,France,35342950,155078,2365,65584518
3,Brazil,34706757,686027,3186,215353593
4,Germany,33312373,149948,1788,83883596


Fix column names


In [68]:
fixed_columns = {
    'Country,\nOther':'Country',
    'Total\nCases':'Cases',
    'Total\nDeaths':'Deaths',
    'Deaths/\n1M pop':'Deaths 1M pop',
}

df.rename(columns=fixed_columns,inplace=True)
df.columns
df.head()

Unnamed: 0,Country,Cases,Deaths,Deaths 1M pop,Population
0,USA,98166904,1084282,3239,334805269
1,India,44587307,528629,376,1406631776
2,France,35342950,155078,2365,65584518
3,Brazil,34706757,686027,3186,215353593
4,Germany,33312373,149948,1788,83883596
