# Hip Hop Analysis Overview
Using a hip hip dataset to learn the fundamentals of Jupyter and data analyst concepts:
- Data Cleaning
- Data Visualization
- Exploratory Data Analysis

# Environment Setup
## Loading Dependencies

In [1]:
import pandas as pd



## Importing Our Hip Hop Dataset
The dataset was found on datpiff.com, but we've placed it in our directory structure and will import the data from there!

In [2]:
file_path = './hip_hop_mixtapes - Listens.csv' # Defining the file path
hip_hop_data = pd.read_csv(file_path) # Reading in our .csv with pandas
hip_hop = hip_hop_data.copy()

In [3]:
# Quick exploration of data
display(hip_hop_data.shape)
display(hip_hop_data.columns)
display(hip_hop_data.head())
display(hip_hop_data.tail())
display(hip_hop_data.info(null_counts=True))

(61, 8)

Index(['Album Name', 'Artist', 'listens', 'Number of Songs',
       'Shortest Song Duration', 'Longest Duration', 'Album Duration',
       'Release Date'],
      dtype='object')

Unnamed: 0,Album Name,Artist,listens,Number of Songs,Shortest Song Duration,Longest Duration,Album Duration,Release Date
0,The Kanan Tape,50 Cent,2968652,7,2:43,4:45,0:25:04,12/09/2015
1,LiveLoveA$Ap,ASAP Rocky,1701695,16,2:38,4:52,0:53:41,10/31/2011
2,Detroit,Big Sean,1755528,18,2:19,5:36,1:07:58,09/05/2012
3,Acid Rap,Chance The Rapper,2093978,14,2:19,5:33,0:53:52,04/30/2013
4,Coloring Book,Chance The Rapper,1950537,14,1:41,6:46,0:57:14,05/13/2016


Unnamed: 0,Album Name,Artist,listens,Number of Songs,Shortest Song Duration,Longest Duration,Album Duration,Release Date
56,I'm Up,Young Thug,1296629,9,3:45,5:04,0:38:03,02/05/2016
57,"No, My Name Is Jeffery",Young Thug,1352267,10,2:45,6:01,0:42:15,08/26/2016
58,Slime Season,Young Thug,3214521,18,2:41,5:33,1:10:06,09/16/2015
59,Slime Season 2,,2857852,22,3:02,4:57,1:27:38,10/31/2015
60,Slime Season 3,Young Thug,1841168,8,2:56,4:38,0:28:20,03/25/2016


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61 entries, 0 to 60
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Album Name              61 non-null     object
 1   Artist                  59 non-null     object
 2   listens                 61 non-null     object
 3   Number of Songs         61 non-null     int64 
 4   Shortest Song Duration  57 non-null     object
 5   Longest Duration        57 non-null     object
 6   Album Duration          54 non-null     object
 7   Release Date            61 non-null     object
dtypes: int64(1), object(7)
memory usage: 3.9+ KB


None

# Data Cleaning
As we can see from our quick exploration above, we can note that we'll need to clean the following:
- Column Headers
- Remove commas from listens
- Convert types
    - listens
    - Number of Songs
    
## Column Headers
Column headers are inconsistent formatting, so we'll convert them all to snake casing.

In [4]:
hip_hop.columns

Index(['Album Name', 'Artist', 'listens', 'Number of Songs',
       'Shortest Song Duration', 'Longest Duration', 'Album Duration',
       'Release Date'],
      dtype='object')

In [5]:
col_headers = list(hip_hop.columns)
formatted_col_headers = []

for col in col_headers:
    col = col.replace(" ", "_")
    formatted_col_headers.append(col.lower())
    
formatted_col_headers

['album_name',
 'artist',
 'listens',
 'number_of_songs',
 'shortest_song_duration',
 'longest_duration',
 'album_duration',
 'release_date']

In [6]:
hip_hop.columns = formatted_col_headers
hip_hop.columns
hip_hop

Unnamed: 0,album_name,artist,listens,number_of_songs,shortest_song_duration,longest_duration,album_duration,release_date
0,The Kanan Tape,50 Cent,2968652,7,2:43,4:45,0:25:04,12/09/2015
1,LiveLoveA$Ap,ASAP Rocky,1701695,16,2:38,4:52,0:53:41,10/31/2011
2,Detroit,Big Sean,1755528,18,2:19,5:36,1:07:58,09/05/2012
3,Acid Rap,Chance The Rapper,2093978,14,2:19,5:33,0:53:52,04/30/2013
4,Coloring Book,Chance The Rapper,1950537,14,1:41,6:46,0:57:14,05/13/2016
...,...,...,...,...,...,...,...,...
56,I'm Up,Young Thug,1296629,9,3:45,5:04,0:38:03,02/05/2016
57,"No, My Name Is Jeffery",Young Thug,1352267,10,2:45,6:01,0:42:15,08/26/2016
58,Slime Season,Young Thug,3214521,18,2:41,5:33,1:10:06,09/16/2015
59,Slime Season 2,,2857852,22,3:02,4:57,1:27:38,10/31/2015


## Removing Delimiters
We noticed that listens has commas and we want to remove these

In [7]:
hip_hop.listens = hip_hop.listens.apply(lambda row: row.replace(',', ''))
hip_hop

Unnamed: 0,album_name,artist,listens,number_of_songs,shortest_song_duration,longest_duration,album_duration,release_date
0,The Kanan Tape,50 Cent,2968652,7,2:43,4:45,0:25:04,12/09/2015
1,LiveLoveA$Ap,ASAP Rocky,1701695,16,2:38,4:52,0:53:41,10/31/2011
2,Detroit,Big Sean,1755528,18,2:19,5:36,1:07:58,09/05/2012
3,Acid Rap,Chance The Rapper,2093978,14,2:19,5:33,0:53:52,04/30/2013
4,Coloring Book,Chance The Rapper,1950537,14,1:41,6:46,0:57:14,05/13/2016
...,...,...,...,...,...,...,...,...
56,I'm Up,Young Thug,1296629,9,3:45,5:04,0:38:03,02/05/2016
57,"No, My Name Is Jeffery",Young Thug,1352267,10,2:45,6:01,0:42:15,08/26/2016
58,Slime Season,Young Thug,3214521,18,2:41,5:33,1:10:06,09/16/2015
59,Slime Season 2,,2857852,22,3:02,4:57,1:27:38,10/31/2015


## Convert Types
We notice now that, listens is a string object. Since we'll eventually want to do some calculations, we'll need to convert this to a int.

In [8]:
hip_hop.info(null_counts=True, memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61 entries, 0 to 60
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   album_name              61 non-null     object
 1   artist                  59 non-null     object
 2   listens                 61 non-null     object
 3   number_of_songs         61 non-null     int64 
 4   shortest_song_duration  57 non-null     object
 5   longest_duration        57 non-null     object
 6   album_duration          54 non-null     object
 7   release_date            61 non-null     object
dtypes: int64(1), object(7)
memory usage: 27.3 KB


In [9]:
hip_hop.listens = hip_hop.listens.astype('int32')
hip_hop.number_of_songs = hip_hop.number_of_songs.astype('int32')

In [12]:
hip_hop.info(null_counts=True, memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61 entries, 0 to 60
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   album_name              61 non-null     object
 1   artist                  59 non-null     object
 2   listens                 61 non-null     int32 
 3   number_of_songs         61 non-null     int32 
 4   shortest_song_duration  57 non-null     object
 5   longest_duration        57 non-null     object
 6   album_duration          54 non-null     object
 7   release_date            61 non-null     object
dtypes: int32(2), object(6)
memory usage: 23.5 KB


In [13]:
hip_hop

Unnamed: 0,album_name,artist,listens,number_of_songs,shortest_song_duration,longest_duration,album_duration,release_date
0,The Kanan Tape,50 Cent,2968652,7,2:43,4:45,0:25:04,12/09/2015
1,LiveLoveA$Ap,ASAP Rocky,1701695,16,2:38,4:52,0:53:41,10/31/2011
2,Detroit,Big Sean,1755528,18,2:19,5:36,1:07:58,09/05/2012
3,Acid Rap,Chance The Rapper,2093978,14,2:19,5:33,0:53:52,04/30/2013
4,Coloring Book,Chance The Rapper,1950537,14,1:41,6:46,0:57:14,05/13/2016
...,...,...,...,...,...,...,...,...
56,I'm Up,Young Thug,1296629,9,3:45,5:04,0:38:03,02/05/2016
57,"No, My Name Is Jeffery",Young Thug,1352267,10,2:45,6:01,0:42:15,08/26/2016
58,Slime Season,Young Thug,3214521,18,2:41,5:33,1:10:06,09/16/2015
59,Slime Season 2,,2857852,22,3:02,4:57,1:27:38,10/31/2015
