# <a id='toc1_'></a>[**Cleaning Demo and Template**](#toc0_)

Name  
Topic  
email  
June 4th, 2023  


**Table of contents**<a id='toc0_'></a>    
- [**Cleaning Demo and Template**](#toc1_)    
    - [**Load**](#toc1_1_1_)    
  - [**Assess**](#toc1_2_)    
    - [Programmatic Assessment](#toc1_2_1_)    
    - [Visual Assessment](#toc1_2_2_)    
    - [List of Issues to Resolve](#toc1_2_3_)    
      - [Quality issues](#toc1_2_3_1_)    
      - [Tidiness issues](#toc1_2_3_2_)    
  - [**Clean**](#toc1_3_)    
    - [Issue #1:](#toc1_3_1_)    
      - [Define:](#toc1_3_1_1_)    
      - [Code](#toc1_3_1_2_)    
      - [Test](#toc1_3_1_3_)    
    - [Issue #2:....](#toc1_3_2_)    
      - [Define:](#toc1_3_2_1_)    
      - [Code](#toc1_3_2_2_)    
      - [Test](#toc1_3_2_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [None]:
from my_code import *

In [None]:
# initialize styling params
plt.rcParams["xtick.direction"] = "in"
plt.rcParams["ytick.direction"] = "in"
plt.rcParams["font.size"] = 11.0
plt.rcParams["figure.figsize"] = (9, 6)
plt.style.use('fivethirtyeight')

sns.set_style("whitegrid")
sns.set_palette("viridis")
sns.set_context("notebook")

pd.set_option("display.max_columns", 50)
pd.set_option('display.max_colwidth', 1000)
pd.plotting.register_matplotlib_converters()
os.environ["PYTHONHASHSEED"] = "123"

# import warnings
# warnings.filterwarnings('ignore')

## <a id='toc1_1_1_'></a>[**Load**](#toc0_)


In [None]:
# read in csv file on disk
twitter_df = pd.read_csv('Data/twitter-archive-enhanced (1).csv')



## <a id='toc1_2_'></a>[**Assess**](#toc0_)

### <a id='toc1_2_1_'></a>[Programmatic Assessment](#toc0_)

In [None]:
twitter_df.describe()

### <a id='toc1_2_2_'></a>[Visual Assessment](#toc0_)

In [None]:
twitter_df.head()

### <a id='toc1_2_3_'></a>[List of Issues to Resolve](#toc0_)

List of Issues to Fix

#### <a id='toc1_2_3_1_'></a>[Quality issues](#toc0_)

1. api_df has 30 missing tweet entries compared with the twitter_df. - Completeness

2. image_df has 281 less data entries as compared with twitter_df and potentially has multiple images for some entries. - Completeness



#### <a id='toc1_2_3_2_'></a>[Tidiness issues](#toc0_)
   
1.   Numerator and Denominator are in two separate columns although they are one piece of information. - Columns  
     -    Combine these two columns into a new singular column. 


## <a id='toc1_3_'></a>[**Clean**](#toc0_)

### <a id='toc1_3_1_'></a>[Issue #1:](#toc0_)

#### <a id='toc1_3_1_1_'></a>[Define:](#toc0_)



The first issues I wanted to address were all the Completeness issues. However, they would all be much easier to accomplish if I merged the 3 dfs together. This would accomplish three of my issues in one swoop. Here is my breakdown of each issue.

**Quality**:
1. api_df has 30 missing tweet entries compared with the twitter_df. - Completeness  
   - Check to see if there is some commonality with the missing tweets. Otherwise, consider deleting incomplete data from twitter_df. 

2. image_df has 281 less data entries as compared with twitter_df and potentially has multiple images for some entries. - Completeness
   - Check to see if there is some commonality with the missing tweets. Otherwise, consider deleting incomplete data from twitter_df. 

**Tidiness**:   

2. Combine the three dataframes into one master dataframe on tweet_id. - Tables   
   - Use pd.merge to merge all three dfs on tweet_id



#### <a id='toc1_3_1_2_'></a>[Code](#toc0_)

In [None]:
master_df = pd.merge(clean_twitter_df, clean_api_df,\
                                how='outer', on='tweet_id', sort=True, copy=True)

master_df = pd.merge(master_df, clean_image_df,\
                                how='outer', on='tweet_id', sort=True, copy=True)

print(master_df.columns)

master_df.info()
master_df.head()


#### <a id='toc1_3_1_3_'></a>[Test](#toc0_)

In [None]:
master_df.info()


### <a id='toc1_3_2_'></a>[Issue #2:....](#toc0_)

#### <a id='toc1_3_2_1_'></a>[Define:](#toc0_)


#### <a id='toc1_3_2_2_'></a>[Code](#toc0_)

#### <a id='toc1_3_2_3_'></a>[Test](#toc0_)