# Exploratory Data Analysis- The Basics
An approach to EDA:  
![image of the data flow showing visualization as an exploratory and iterative process](http://benbestphd.com/images/r4ds_data-science.png)

#### The goal of EDA is to discover patterns in data. This is a fundamental stepping stone towards predictive modelling, or an end goal in itself. 

Tips for good EDA:
- Get to know the context of the data.  
- Question the data: Who collected it? Who is distributing it? Do all of the patterns make sense to what you know about the world? If they don’t, go back and look more closely at your data.

- Use EDA to formulate a question based on the patterns that you see.
- Use EDA to check if a hypothesis is worth a deeper analysis.

- Keep the questions SIMPLE and BRIEF- the goal is to understand and build complexity further on.
- Its an iterative process-- its okay to repeat things so long as you learn from previous output.

In [None]:
# importing the libraries for data processing
import numpy as np 
import pandas as pd 


### 1. Tidying our charts data

Read the csv file, check for missing, duplicated and unexpected values, and filtering if needed

In [None]:
# read the charts dataset
charts_df = pd.read_csv('data/spotify_daily_charts.csv')
charts_df.head()

### Data Checks
It is prudent to do the following on a DataFrame before any analysis is made
1. Check shape
2. Check data types of columns
3. Check null values in columns
4. Check rows with null values
5. Check for duplicates

In [None]:
#Check the shape of the dataframe
charts_df.shape 

In [None]:
#list comprehension to check elements of a list not in another list
complete_dates = pd.date_range(start='2017-01-01', end='2021-05-20', freq='D').strftime('%Y-%m-%d')
dataset_dates = pd.unique(charts_df['date'])

[p for p in complete_dates if p not in dataset_dates]

In [None]:
200*len(charts_df['date'].unique())

In [None]:
#Check data types of the columns
charts_df.dtypes

In [None]:
#Check null values in the columns
charts_df.info()

In [None]:
charts_df[charts_df['artist'].isnull()]

In [None]:
#Check for duplicates
sum(charts_df.duplicated())

In [None]:
#check if unique values are expected
charts_df['position'].unique()

In [None]:
len(charts_df['artist'].unique())

In [None]:
len(charts_df['track_name'].unique())

In [None]:
len(charts_df['track_id'].unique())

> Q: Why do we have N track ids but only M track names?

##### Convert date to datetime index
Pandas has a very useful method `pd.to_datetime` that smartly recognizes date and time columns and allows for easier time series techniques

In [None]:
#transform date column into a datetime column
charts_df['date'] = pd.to_datetime(charts_df['date'])
charts_df.head()

In [None]:
#extract month 
charts_df['month']=charts_df['date'].dt.month
charts_df.head()

In [None]:
#extract year
charts_df['year']=charts_df['date'].dt.year
# get day and day of week
charts_df['day']=charts_df['date'].dt.day
charts_df['day_of_week']=charts_df['date'].dt.dayofweek # The day of the week with Monday=0, Sunday=6.
charts_df.head()

### 2. Examining the charts data
Reshape and aggregate the DataFrame to answer basic data questions 

In [None]:
#Lets create tallies of each column using the `value_counts` method
charts_df['artist'].value_counts()[:20]

In [None]:
charts_df['track_name'].value_counts()

In [None]:
#filtering columns
charts_df[charts_df['track_name']=='Happier']

> Q1. From top 50 most streamed, get top 20 most frequently occuring artists

In [None]:
charts_df[charts_df['position']<=50]['artist'].value_counts()[:20]

> Q2. From top 50 list this year, get top 20 most frequently occuring artists

In [None]:
charts_df[(charts_df['position']>=50)&(charts_df['year']==2020)]['artist'].value_counts()[:20]

> Q3. On what positions did Taylor Swift land on the chart for 2019? What were her songs that landed first on the chart?

In [None]:
np.sort(charts_df[(charts_df['artist']=='Taylor Swift')&(charts_df['year']==2019)]['position'].unique())

In [None]:
charts_df[(charts_df['artist']=='Taylor Swift')&\
                    (charts_df['year']==2019)&\
                    (charts_df['position']==1)]['track_name'].unique()

### 3. Quick stats and Aggregating the charts dataset


**Quick stats**


Basic stats can be computed using the `describe` method

In [None]:
charts_df['streams'].describe()

**Aggregation**

The pandas `groupby` method functions in the same way as pivot_table in excel

The syntax for a single index column and single agg column:
```python
df.groupby('index_col')['agg_col'].aggfunc()
```

A good analogy for  pandas `groupby` is making cocktails at a party: the glasses is the items in `index_col`, the beverage is the `agg_col`, and how the beverage is poured into the glasses is `aggfunc`.

<img src="groupby.png" align="left" alt="Drawing" style="width: 300px;"/>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>


For multiple indices and multiple aggregations:
```python
df.groupby(['index_col','index_col2']).agg('agg_col1':aggfunc1, 'agg_col2':aggfunc2)
```

Q: How many total streams did charting songs in Spotify earn per year?

In [None]:
charts_df.groupby('year')['streams'].sum()   #inputting a column name string in agg_column outputs a Series

In [None]:
charts_df.groupby('year')[['streams']].sum()   #inputting a list in agg_column outputs a DataFrame

> Q: How many streams did each of the 200 positions contribute to the annual streams of spotify?

In [None]:
charts_df.groupby(['year','position'])[['streams']].sum()   #inputting a list in agg_column outputs a DataFrame

> Q: What visualization would best suit the output of the cell above?

### 4. Combining two datasets

- What insights could we get from merging the charts and tracks datasets?

In [None]:
# read the tracks dataset
tracks_df = pd.read_csv('data/spotify_daily_charts_tracks.csv')
tracks_df.head()

**Combining dataframes*

The pandas `merge` method combines DataFrames/Series in the same way common database languages (eg. SQL) perform table joins.

The most basic syntax for a single index column and single agg column:
```python
df1.merge(df2, on='key_column', how=<'join_type'>)
```
where <'join_type'> = ['left','right','inner','outer']

Here is a diagram that illustrates what each of the join type produces
<br>
<br>
<img src="merge.png" align="left" alt="Drawing" style="width: 500px;"/>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
For multiple key columns:
```python
df1.merge(df2, on=['key_column1','key_column2'], how=<'join_type'>)
```

In [None]:
#merge charts dataframe with tracks dataframe
#follow charts_df's rows

df = charts_df.merge(tracks_df, on='track_id', how='left')
df.head()

In [None]:
#Always check number of rows when performing merges
charts_df.shape, tracks_df.shape, df.shape

In [None]:
df.columns

In [None]:
#drop duplicated track_name column
df = df.drop(columns='track_name_y')
#rename trace_name x
df = df.rename(columns={'track_name_x':'track_name'})
df.head()

In [None]:
#check if expected columns are present
df.columns

## Q&A

Q1: What are the top 10 songs in terms of total streams from 2017 to 2020?

In [None]:
# groupby tracks and sum streams, sort and get first 10 rows 
df.groupby(['track_id','track_name'])['streams'].sum().sort_values(ascending=False)[:10]

Q2: Whats the mean tempo of the top 10 most streamed songs?

In [None]:
top10songs = df.groupby(['track_id','track_name'])['streams'].sum()\
            .sort_values(ascending=False)[:10]\
            .reset_index()['track_id'].values
top10songs

In [None]:
#isin selects elements in list
df[df['track_id'].isin(top10songs)]['tempo'].mean() #in bpm

Q2a. Follow-up: How does this compare with the mean tempo of the rest of the songs?

In [None]:
#use ~ to negate
df[~df['track_id'].isin(top10songs)]['tempo'].mean() #in bpm

**Self Check**

Q: Which song had the most days within top 5 of the charts for 2020?

Q: Which artist had the most days within top 10 of the charts for 2020?

 Q: What are the top 5 “saddest” charting songs for 2020? 

### Plain tables as output?
1. Tables are simple fast answers to simple fast questions
2. Tables are very useful for troubleshooting. The numbers often reveal if there was something wrong with the data source/processing
3. In most office setups, analtyics output are often offtaked by another team (e.g. market segments group -> finance for sales projections). As it could be readily plugged into their computations, they usually prefer tables instead of deployed products.