# Real-world Data Wrangling

In this project, you will apply the skills you acquired in the course to gather and wrangle real-world data with two datasets of your choice.

You will retrieve and extract the data, assess the data programmatically and visually, accross elements of data quality and structure, and implement a cleaning strategy for the data. You will then store the updated data into your selected database/data store, combine the data, and answer a research question with the datasets.

Throughout the process, you are expected to:

1. Explain your decisions towards methods used for gathering, assessing, cleaning, storing, and answering the research question
2. Write code comments so your code is more readable

Before you start, install the some of the required packages. 

In [None]:
!python -m pip install kaggle==1.6.12

In [None]:
!pip install --target=/workspace ucimlrepo

In [1]:
import pandas as pd

## 1. Gather data

In this section, you will extract data using two different data gathering methods and combine the data. Use at least two different types of data-gathering methods.

### **1.1.** Problem Statement
In 2-4 sentences, explain the kind of problem you want to look at and the datasets you will be wrangling for this project.

Finding the right datasets can be time-consuming. Here we provide you with a list of websites to start with. But we encourage you to explore more websites and find the data that interests you.

* Google Dataset Search https://datasetsearch.research.google.com/
* The U.S. Government’s open data https://data.gov/
* UCI Machine Learning Repository https://archive.ics.uci.edu/ml/index.php

### **1.2.** Gather at least two datasets using two different data gathering methods

List of data gathering methods:

- Download data manually
- Programmatically downloading files
- Gather data by accessing APIs
- Gather and extract data from HTML files using BeautifulSoup
- Extract data from a SQL database

Each dataset must have at least two variables, and have greater than 500 data samples within each dataset.

For each dataset, briefly describe why you picked the dataset and the gathering method (2-3 full sentences), including the names and significance of the variables in the dataset. Show your work (e.g., if using an API to download the data, please include a snippet of your code). 

Load the dataset programmtically into this notebook.

#### **Dataset 1**: Infrastructure Damaged

Type: JSON File

Method: The data was gathered using the "Programmatically downloading files" method from https://github.com/TechForPalestine/palestine-datasets/tree/main github repository.

Dataset variables:
## Infrastructure Damage Report Data Description

| Column Name | Description |
|---|---|
| report_date | The date the damage report |
| civic_buildings_ext_destroyed | The number of civic buildings with extensive destruction. |
| civic_buildings_destroyed | The number of civic buildings destroyed |
| educational_buildings_ext_destroyed | The number of educational buildings with extensive destruction. |
| educational_buildings_ext_damaged | The number of educational buildings with extensive damage. |
| educational_buildings_destroyed | The number of educational buildings destroyed. |
| educational_buildings_damaged | The number of educational buildings with some damage. |
| places_of_worship_ext_mosques_destroyed | The number of mosques within places of worship with extensive destruction. |
| places_of_worship_ext_mosques_damaged | The number of mosques within places of worship with extensive damage. |
| places_of_worship_ext_churches_destroyed | The number of churches within places of worship with extensive destruction. |
| places_of_worship_mosques_destroyed | The number of mosques within places of worship destroyed. |
| places_of_worship_mosques_damaged | The number of mosques within places of worship with some damage. |
| places_of_worship_churches_destroyed | The number of churches within places of worship destroyed. |
| residential_ext_destroyed | The number of residential buildings with extensive destruction. |
| residential_destroyed | The number of residential buildings destroyed. |

In [6]:
## Gather the data using the "Programmatically downloading files" method from https://github.com/
nfrastructure_damaged_url ="https://raw.githubusercontent.com/TechForPalestine/palestine-datasets/main/infrastructure-damaged.json"
nfrastructure_damaged_df = pd.read_json(nfrastructure_damaged_url)
nfrastructure_damaged_df.head()

Unnamed: 0,report_date,civic_buildings,educational_buildings,places_of_worship,residential
0,2023-10-07,{'ext_destroyed': 5},"{'ext_destroyed': 1, 'ext_damaged': 15}","{'ext_mosques_destroyed': 2, 'ext_mosques_dama...",{'ext_destroyed': 80}
1,2023-10-08,{'ext_destroyed': 11},"{'ext_destroyed': 1, 'ext_damaged': 30}","{'ext_mosques_destroyed': 4, 'ext_mosques_dama...","{'destroyed': 159, 'ext_destroyed': 159}"
2,2023-10-09,{'ext_destroyed': 16},"{'ext_destroyed': 2, 'ext_damaged': 45}","{'ext_mosques_destroyed': 6, 'ext_mosques_dama...","{'destroyed': 790, 'ext_destroyed': 790}"
3,2023-10-10,{'ext_destroyed': 22},"{'ext_destroyed': 2, 'ext_damaged': 60}","{'ext_mosques_destroyed': 8, 'ext_mosques_dama...","{'destroyed': 1009, 'ext_destroyed': 1009}"
4,2023-10-11,"{'destroyed': 27, 'ext_destroyed': 27}","{'destroyed': 3, 'ext_destroyed': 3, 'damaged'...","{'mosques_destroyed': 10, 'ext_mosques_destroy...","{'destroyed': 2835, 'ext_destroyed': 2835}"


In [7]:
infrastructure_damaged_normalized = pd.DataFrame()
infrastructure_damaged_normalized['report_date'] = nfrastructure_damaged_df['report_date']

for col in nfrastructure_damaged_df.columns:    
    normalized_data = pd.json_normalize(nfrastructure_damaged_df[col])
    normalized_data.columns = [f"{col}_{subcol}" for subcol in normalized_data.columns]  # Rename columns with prefix
    infrastructure_damaged_normalized = pd.concat([infrastructure_damaged_normalized, normalized_data], axis=1)

infrastructure_damaged_normalized.head()

Unnamed: 0,report_date,civic_buildings_ext_destroyed,civic_buildings_destroyed,educational_buildings_ext_destroyed,educational_buildings_ext_damaged,educational_buildings_destroyed,educational_buildings_damaged,places_of_worship_ext_mosques_destroyed,places_of_worship_ext_mosques_damaged,places_of_worship_ext_churches_destroyed,places_of_worship_mosques_destroyed,places_of_worship_mosques_damaged,places_of_worship_churches_destroyed,residential_ext_destroyed,residential_destroyed
0,2023-10-07,5,,1,15,,,2,4,0,,,,80,
1,2023-10-08,11,,1,30,,,4,8,0,,,,159,159.0
2,2023-10-09,16,,2,45,,,6,12,0,,,,790,790.0
3,2023-10-10,22,,2,60,,,8,17,0,,,,1009,1009.0
4,2023-10-11,27,27.0,3,75,3.0,75.0,10,21,0,10.0,,,2835,2835.0


> Storing step: save the dataset to the local data store before moving to the next step.

In [8]:
infrastructure_damaged_normalized.to_csv('./data/infrastructure_damaged.csv') ## store to cvs file

#### **Dataset 2**: Casualties daily

Type: JSON file

Method: The data was gathered using the "Download data manually" method from https://github.com/TechForPalestine/palestine-datasets/blob/main/casualties_daily.json repository source.

Dataset variables:

| Column Name | Description |
|---|---|
| report_date | Date of the report |
| report_source | Source of the report |
| ext_massacres_cum | Cumulative number of external massacres |
| killed | Total number of people killed |
| killed_cum | Cumulative number of people killed |
| ext_killed | Number of people killed by external forces |
| ext_killed_cum | Cumulative number of people killed by external forces |
| ext_killed_children_cum | Cumulative number of children killed by external forces |
| ext_killed_women_cum | Cumulative number of women killed by external forces |
| injured_cum | Cumulative number of people injured |
| ext_injured | Number of people injured by external forces |
| ext_injured_cum | Cumulative number of people injured by external forces |
| ext_civdef_killed_cum | Cumulative number of civilians/defenders killed by external forces |
| med_killed_cum | Cumulative number of medical personnel killed |
| ext_med_killed_cum | Cumulative number of medical personnel killed by external forces |
| press_killed_cum | Cumulative number of press personnel killed |
| ext_press_killed_cum | Cumulative number of press personnel killed by external forces |
| killed_children_cum | Cumulative number of children killed |
| killed_women_cum | Cumulative number of women killed |
| injured | Number of people injured |
| massacres_cum | Cumulative number of massacres |
| civdef_killed_cum | Cumulative number of civilians/defenders killed |

In [4]:
## 2nd data gathering was manually downloaded from https://github.com/TechForPal
daily_casualties = pd.read_json('./data/casualties_daily.json')
daily_casualties.head()

Unnamed: 0,report_date,report_source,ext_massacres_cum,killed,killed_cum,ext_killed,ext_killed_cum,ext_killed_children_cum,ext_killed_women_cum,injured_cum,...,ext_civdef_killed_cum,med_killed_cum,ext_med_killed_cum,press_killed_cum,ext_press_killed_cum,killed_children_cum,killed_women_cum,injured,massacres_cum,civdef_killed_cum
0,2023-10-07,mohtel,0,232.0,232.0,232,232,0,0,1610.0,...,0,6.0,6,1.0,1,,,,,
1,2023-10-08,mohtel,0,138.0,370.0,138,370,78,41,1788.0,...,0,,6,1.0,1,78.0,41.0,,,
2,2023-10-09,mohtel,8,190.0,560.0,190,560,91,61,2271.0,...,0,6.0,6,3.0,3,91.0,61.0,,,
3,2023-10-10,mohtel,8,340.0,900.0,340,900,260,230,4000.0,...,0,,6,7.0,7,260.0,230.0,,,
4,2023-10-11,gmotel,23,200.0,1100.0,200,1100,398,230,5184.0,...,0,10.0,10,,7,,,1029.0,,


> Storing step: save the dataset to the local data store before moving to the next step.

In [9]:
daily_casualties.to_csv('./data/daily_casualties.csv') ## store to cvs file

## 2. Assess data

Assess the data according to data quality and tidiness metrics using the report below.

List **two** data quality issues and **two** tidiness issues. Assess each data issue visually **and** programmatically, then briefly describe the issue you find.  **Make sure you include justifications for the methods you use for the assessment.**


Now that we have gathered the datasets, let's assess the dataset for data quality and structural issues.

Here's a list of the data quality attributes we covered in the course for your reference:

    Completeness
    Validity
    Accuracy
    Consistency
    Uniqueness

### Quality Issue 1:

In [None]:
#FILL IN - Inspecting the dataframe visually

In [None]:
#FILL IN - Inspecting the dataframe programmatically

Issue and justification: *FILL IN*

### Quality Issue 2:

In [None]:
#FILL IN - Inspecting the dataframe visually

In [None]:
#FILL IN - Inspecting the dataframe programmatically

Issue and justification: *FILL IN*

### Tidiness Issue 1:

In [None]:
#FILL IN - Inspecting the dataframe visually

In [None]:
#FILL IN - Inspecting the dataframe programmatically

Issue and justification: *FILL IN*

### Tidiness Issue 2: 

In [None]:
#FILL IN - Inspecting the dataframe visually

In [None]:
#FILL IN - Inspecting the dataframe programmatically

Issue and justification: *FILL IN*

## 3. Clean data
It's time to address the issues found during assessment to clean and polish your data.

Clean the data to solve the 4 issues corresponding to data quality and tidiness found in the assessing step. **Make sure you include justifications for your cleaning decisions.**

After the cleaning for each issue, please use **either** the visually or programatical method to validate the cleaning was succesful.

At this stage, you are also expected to remove variables that are unnecessary for your analysis and combine your datasets. Depending on your datasets, you may choose to perform variable combination and elimination before or after the cleaning stage. Your dataset must have **at least** 4 variables after combining the data.

In [None]:
# FILL IN - Make copies of the datasets to ensure the raw dataframes 
# are not impacted

### **Quality Issue 1: FILL IN**

In [None]:
# FILL IN - Apply the cleaning strategy

In [None]:
# FILL IN - Validate the cleaning was successful

Justification: *FILL IN*

### **Quality Issue 2: FILL IN**

In [None]:
#FILL IN - Apply the cleaning strategy

In [None]:
#FILL IN - Validate the cleaning was successful

Justification: *FILL IN*

### **Tidiness Issue 1: FILL IN**

In [None]:
#FILL IN - Apply the cleaning strategy

In [None]:
#FILL IN - Validate the cleaning was successful

Justification: *FILL IN*

### **Tidiness Issue 2: FILL IN**

In [None]:
#FILL IN - Apply the cleaning strategy

In [None]:
#FILL IN - Validate the cleaning was successful

Justification: *FILL IN*

### **Remove unnecessary variables and combine datasets**

Depending on the datasets, you can also peform the combination before the cleaning steps.

In [None]:
#FILL IN - Remove unnecessary variables and combine datasets

## 4. Update your data store
Update your local database/data store with the cleaned data, following best practices for storing your cleaned data:

- Must maintain different instances / versions of data (raw and cleaned data)
- Must name the dataset files informatively
- Ensure both the raw and cleaned data is saved to your database/data store

In [None]:
#FILL IN - saving data

## 5. Answer the research question

### **5.1:** Define and answer the research question 
Going back to the problem statement in step 1, use the cleaned data to answer the question you raised. Produce **at least** two visualizations using the cleaned data and explain how they help you answer the question.

*Research question:* FILL IN from answer to Step 1

In [None]:
#Visual 1 - FILL IN

*Answer to research question:* FILL IN

In [None]:
#Visual 2 - FILL IN

*Answer to research question:* FILL IN

### **5.2:** Reflection
In 2-4 sentences, if you had more time to complete the project, what actions would you take? For example, which data quality and structural issues would you look into further, and what research questions would you further explore?

*Answer:* FILL IN