<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Data Manipulation, EDA, and Reporting Results

_Authors: Joseph Nelson (DC), Sam Stack (DC)_

---

> **This lab is intentionally open-ended, and you're encouraged to answer your own questions about the dataset!**


### What makes a song a hit?

On next week's episode of the 'Are You Entertained?' podcast, we're going to be analyzing the latest generation's guilty pleasure- the music of the '00s. 

Our Data Scientists have poured through Billboard chart data to analyze what made a hit soar to the top of the charts, and how long they stayed there. Tune in next week for an awesome exploration of music and data as we continue to address an omnipresent question in the industry- why do we like what we like?

**Provide (at least) a markdown cell explaining your key learnings about top hits: what are they, what common themes are there, is there a trend among artists (type of music)?**

---

### Minimum Requirements

**At a minimum, you must:**

- Use Pandas to read in your data
- Rename column names where appropriate
- Describe your data: check the value counts and descriptive statistics
- Make use of groupby statements
- Utilize Boolean sorting
- Assess the validity of your data (missing data, distributions?)

**You should strive to:**

- Produce a blog-post ready description of your lab
- State your assumptions about the data
- Describe limitations
- Consider how you can action this from a stakeholder perspective (radio, record label, fan)
- Include visualizations

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Billboard data CSV:
billboard_csv = './datasets/billboard.csv'

# We need to use encoding='latin-1' to deal with non-ASCII characters.
df = pd.read_csv(billboard_csv, encoding='latin-1')

In [2]:
df.head()

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,x1st.week,x2nd.week,x3rd.week,...,x67th.week,x68th.week,x69th.week,x70th.week,x71st.week,x72nd.week,x73rd.week,x74th.week,x75th.week,x76th.week
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,78,63.0,49.0,...,,,,,,,,,,
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,15,8.0,6.0,...,,,,,,,,,,
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,71,48.0,43.0,...,,,,,,,,,,
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,41,23.0,18.0,...,,,,,,,,,,
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,57,47.0,45.0,...,,,,,,,,,,


In [6]:
df.columns = df.columns.str.replace('.', '_')

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 317 entries, 0 to 316
Data columns (total 83 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   year             317 non-null    int64  
 1   artist_inverted  317 non-null    object 
 2   track            317 non-null    object 
 3   time             317 non-null    object 
 4   genre            317 non-null    object 
 5   date_entered     317 non-null    object 
 6   date_peaked      317 non-null    object 
 7   x1st_week        317 non-null    int64  
 8   x2nd_week        312 non-null    float64
 9   x3rd_week        307 non-null    float64
 10  x4th_week        300 non-null    float64
 11  x5th_week        292 non-null    float64
 12  x6th_week        280 non-null    float64
 13  x7th_week        269 non-null    float64
 14  x8th_week        260 non-null    float64
 15  x9th_week        253 non-null    float64
 16  x10th_week       244 non-null    float64
 17  x11th_week      

In [4]:
df.describe()

Unnamed: 0,year,x1st.week,x2nd.week,x3rd.week,x4th.week,x5th.week,x6th.week,x7th.week,x8th.week,x9th.week,...,x67th.week,x68th.week,x69th.week,x70th.week,x71st.week,x72nd.week,x73rd.week,x74th.week,x75th.week,x76th.week
count,317.0,317.0,312.0,307.0,300.0,292.0,280.0,269.0,260.0,253.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,2000.0,79.958991,71.173077,65.045603,59.763333,56.339041,52.360714,49.219331,47.119231,46.343874,...,,,,,,,,,,
std,0.0,14.686865,18.200443,20.752302,22.324619,23.780022,24.473273,25.654279,26.370782,27.136419,...,,,,,,,,,,
min,2000.0,15.0,8.0,6.0,5.0,2.0,1.0,1.0,1.0,1.0,...,,,,,,,,,,
25%,2000.0,74.0,63.0,53.0,44.75,38.75,33.75,30.0,27.0,26.0,...,,,,,,,,,,
50%,2000.0,81.0,73.0,66.0,61.0,57.0,51.5,47.0,45.5,42.0,...,,,,,,,,,,
75%,2000.0,91.0,84.0,79.0,76.0,73.25,72.25,67.0,67.0,67.0,...,,,,,,,,,,
max,2000.0,100.0,100.0,100.0,100.0,100.0,99.0,100.0,99.0,100.0,...,,,,,,,,,,


In [10]:
df.head()

Unnamed: 0,year,artist_inverted,track,time,genre,date_entered,date_peaked,x1st_week,x2nd_week,x3rd_week,...,x67th_week,x68th_week,x69th_week,x70th_week,x71st_week,x72nd_week,x73rd_week,x74th_week,x75th_week,x76th_week
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,78,63.0,49.0,...,,,,,,,,,,
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,15,8.0,6.0,...,,,,,,,,,,
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,71,48.0,43.0,...,,,,,,,,,,
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,41,23.0,18.0,...,,,,,,,,,,
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,57,47.0,45.0,...,,,,,,,,,,


In [17]:
# get the number of days before a song becomes peaked
df['date_entered'] = pd.to_datetime(df.date_entered)
df['date_peaked'] = pd.to_datetime(df.date_peaked)
df['days_to_peak'] = (df['date_peaked'] - df['date_entered'])
df['days_to_peak']

0     56 days
1     56 days
2     98 days
3     35 days
4     70 days
        ...  
312    0 days
313    0 days
314    0 days
315    0 days
316    0 days
Name: days_to_peak, Length: 317, dtype: timedelta64[ns]

In [32]:
df.days_to_peak.mean()
df.days_to_peak.max()
df.days_to_peak.min()

Timedelta('0 days 00:00:00')

In [42]:
df.groupby('artist_inverted').mean(df.days_to_peak).plot(kind='bar');

UnsupportedFunctionCall: numpy operations are not valid with groupby. Use .groupby(...).mean() instead