**Table of contents**<a id='toc0_'></a>    
- [**High Level Data Cleaning Process**](#toc1_)    
  - [***Gather***](#toc1_1_)    
  - [***Assess***](#toc1_2_)    
  - [***Clean***](#toc1_3_)    
  - [**Clean Data has Two Dimensions:**](#toc1_4_)    
      - [***Cleanliness***](#toc1_4_1_1_)    
      - [***Tidiness:***](#toc1_4_1_2_)    
  - [**Order for Addressing Problems:**](#toc1_5_)    
- [Demo](#toc2_)    
  - [Gathering](#toc2_1_)    
    - [Setup](#toc2_1_1_)    
    - [Load](#toc2_1_2_)    
  - [Assess](#toc2_2_)    
    - [Programmatic Assessment](#toc2_2_1_)    
    - [Visual Assessment](#toc2_2_2_)    
    - [Issues to Resolve](#toc2_2_3_)    
      - [Quality issues](#toc2_2_3_1_)    
      - [Tidiness issues](#toc2_2_3_2_)    
  - [Clean](#toc2_3_)    
    - [Issue #1:](#toc2_3_1_)    
      - [Define:](#toc2_3_1_1_)    
      - [Code](#toc2_3_1_2_)    
      - [Test](#toc2_3_1_3_)    
    - [Issue #2:....](#toc2_3_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[**High Level Data Cleaning Process**](#toc0_)



## <a id='toc1_1_'></a>[***Gather***](#toc0_)
1. **Setup Libraries**: Import all the necessary libraries that you will need for your data analysis. This typically includes libraries like pandas, numpy, matplotlib, seaborn, etc. Setting up libraries at the beginning of your script ensures you have all the tools you need for analysis, visualization, and modeling.


In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns



2. **Load in Data**: Read in the data from your source file (like a CSV, Excel, SQL database, etc.) into a DataFrame, which is a type of data structure provided by the pandas library. This is your starting point for the data analysis.


In [None]:

df = pd.read_csv('data.csv')


## <a id='toc1_2_'></a>[***Assess***](#toc0_)
1. **Programmatic Assessment**: Review the data using code. This includes methods like  `df.info()`, `df.describe()`, etc. These methods help you understand the structure of the data, the types of variables you have, and basic statistics of the variables.

2. **Visual Assessment**: Review the data by scrolling through it in a spreadsheet or using `df.head()`, `df.tail()`. This can help you spot anomalies or patterns in the data that may not be immediately apparent through programmatic assessment.



## <a id='toc1_3_'></a>[***Clean***](#toc0_)
1. **Define**: Define how you will clean the issue in words. This is your plan of action for dealing with the identified data quality and tidiness issues. It's important to define this plan before you start coding to ensure you have a clear understanding of the steps you need to take.

2. **Code**: Convert your definitions into executable code. This is where you implement your plan. This could involve writing functions to clean the data, using built-in pandas functions, or using other data cleaning libraries.

3. **Test**: Test your data to ensure your code was implemented correctly. This involves checking your cleaned data to confirm that it's in the expected format and that the data quality and tidiness issues have been addressed. This can be done using a combination of programmatic and visual assessments.

## <a id='toc1_4_'></a>[**Clean Data has Two Dimensions:**](#toc0_)

#### <a id='toc1_4_1_1_'></a>[Cleanliness](#toc0_)
1. **Completeness**: Do we have all of the records that we should?
2. **Validity**: We have the records, but they're not valid, i.e., they don't conform to a defined schema, also known as a defined set of rules for data.
3. **Accuracy**: Inaccurate data is wrong data that is valid. It adheres to the defined schema, but it is still incorrect.
4. **Consistency**: Inconsistent data is both valid and accurate, but there are multiple correct ways of referring to the same thing.

#### <a id='toc1_4_1_2_'></a>[Tidiness:](#toc0_)
- Tidy data has 3 qualities:
    - Each variable forms a column.
    - Each observation forms a row.
    - Each type of observational unit forms a table.

## <a id='toc1_5_'></a>[Order for Addressing Problems:](#toc0_)

 1. **Completeness issues** or **Fix Missing Data**: It's important to do this upfront so that subsequent data cleaning will not have to be repeated.
 2. **Tidy the Tables**: Tidy datasets with data quality issues are almost always easier to clean than untidy datasets with the same issues.
 3. **Quality Control**: Address the remaining validity, accuracy, and consistency issues.

# <a id='toc2_'></a>[**Demo**](#toc0_)

## <a id='toc2_1_'></a>[**Gathering**](#toc0_)

### <a id='toc2_1_1_'></a>[Setup](#toc0_)

In [None]:
import pandas as pd
import numpy as np

### <a id='toc2_1_2_'></a>[Load](#toc0_)


In [None]:
# read in csv file on disk
twitter_df = pd.read_csv('Data/twitter-archive-enhanced (1).csv')



## <a id='toc2_2_'></a>[**Assess**](#toc0_)

### <a id='toc2_2_1_'></a>[Programmatic Assessment](#toc0_)

In [None]:
twitter_df.describe()

### <a id='toc2_2_2_'></a>[Visual Assessment](#toc0_)

In [None]:
twitter_df.head()

### <a id='toc2_2_3_'></a>[List of Issues to Resolve](#toc0_)

List of Issues to Fix

#### <a id='toc2_2_3_1_'></a>[Quality issues](#toc0_)

1. api_df has 30 missing tweet entries compared with the twitter_df. - Completeness

2. image_df has 281 less data entries as compared with twitter_df and potentially has multiple images for some entries. - Completeness



#### <a id='toc2_2_3_2_'></a>[Tidiness issues](#toc0_)
   
1.   Numerator and Denominator are in two separate columns although they are one piece of information. - Columns  
     -    Combine these two columns into a new singular column. 


## <a id='toc2_3_'></a>[**Clean**](#toc0_)

### <a id='toc2_3_1_'></a>[Issue #1:](#toc0_)

#### <a id='toc2_3_1_1_'></a>[Define:](#toc0_)



The first issues I wanted to address were all the Completeness issues. However, they would all be much easier to accomplish if I merged the 3 dfs together. This would accomplish three of my issues in one swoop. Here is my breakdown of each issue.

**Quality**:
1. api_df has 30 missing tweet entries compared with the twitter_df. - Completeness  
   - Check to see if there is some commonality with the missing tweets. Otherwise, consider deleting incomplete data from twitter_df. 

2. image_df has 281 less data entries as compared with twitter_df and potentially has multiple images for some entries. - Completeness
   - Check to see if there is some commonality with the missing tweets. Otherwise, consider deleting incomplete data from twitter_df. 

**Tidiness**:   

2. Combine the three dataframes into one master dataframe on tweet_id. - Tables   
   - Use pd.merge to merge all three dfs on tweet_id



#### <a id='toc2_3_1_2_'></a>[Code](#toc0_)

In [None]:
master_df = pd.merge(clean_twitter_df, clean_api_df,\
                                how='outer', on='tweet_id', sort=True, copy=True)

master_df = pd.merge(master_df, clean_image_df,\
                                how='outer', on='tweet_id', sort=True, copy=True)

print(master_df.columns)

master_df.info()
master_df.head()


#### <a id='toc2_3_1_3_'></a>[Test](#toc0_)

In [None]:
master_df.info()


### <a id='toc2_3_2_'></a>[Issue #2:....](#toc0_)

#### <a id='toc2_3_1_1_'></a>[Define:](#toc0_)


#### <a id='toc2_3_1_2_'></a>[Code](#toc0_)

#### <a id='toc2_3_1_3_'></a>[Test](#toc0_)