# Introduction to Exploratory Data Analysis (EDA) using `pandas`

The data for this exercise is borrowed from [here](https://www.kaggle.com/leonardopena/top50spotify2019).

### 1. Read data

In [None]:
import pandas as pd

df = pd.read_csv(r"../data/top50.csv", encoding='ISO-8859-2')

_`pandas.read_csv()` documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html_

_Python encoding: https://docs.python.org/3/library/codecs.html#standard-encodings_

View data.

In [None]:
df.head()

Display some basic information about this data frame.

In [None]:
df.info()

You can also use `df.dtypes` to view data types for each column.

View column names.

In [None]:
df.columns

### Drop a column

In [None]:
df.columns[0]

In [None]:
df = df.drop(df.columns[0], axis=1)

df.head()

### Modify column names

View column names.

In [None]:
df.columns

Convert the column-names array into a string object.

In [None]:
type(df.columns.str)

Example: Convert all column names into uppercase.

In [None]:
df.columns.str.upper()

Remove dots from column names.

In [None]:
df.columns.str.replace('.', "")

Save the results into the data frame.

In [None]:
df.columns = df.columns.str.replace('.', "")

In [None]:
df.head()

### Correlations

In [None]:
corr_matrix = df.corr()

corr_matrix

Plot the correlation matrix.

In [None]:
# import necessary packages
import matplotlib.pyplot as plt
import seaborn as sns

# if needed, enable inline plots
plt.ion()

_The `ion()` function causes the output of plotting commands to be displayed inline within frontends (like the Jupyter notebook), directly below the code cell that produced it. The resulting plots will then also be stored in the notebook document. More details here: https://ipython.readthedocs.io/en/stable/interactive/plotting.html_

In [None]:
# create a heat map
sns.heatmap(corr_matrix);

Increase the figure size.

In [None]:
# set figure size
plt.figure(figsize=(12, 9))

# create a heat map
sns.heatmap(corr_matrix);

Change the color palette.

In [None]:
# define a color palette
cmap = sns.diverging_palette(10, 220, n=20)

plt.figure(figsize=(12, 9))

sns.heatmap(corr_matrix, cmap=cmap)

# if needed, adjust the truncated top and bottom rows
#ax.set_ylim(len(corr_matrix), 0)

_Note: Depending on the version of `matplotlib` on your computer, some of you may see truncated top and bottom rows. This is a known bug; see https://github.com/matplotlib/matplotlib/issues/14751_

### Distributions

View summary statistics.

In [None]:
df.describe()

_`pandas.DataFrame.describe` documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html_

Transpose the summary statistics for better visibility.

In [None]:
df.describe().T

Export the summary statistics.

In [None]:
df.describe().T.to_csv(r'../output/descr_top50.csv')

### Histograms

Plot a histogram for one variable.

In [None]:
df['Energy'].hist();

_`pandas.DataFrame.hist` documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html_

Plot histograms for multiple variables.

In [None]:
df.hist();

Change the bin size.

In [None]:
df.hist(bins=100);

#### Format plots.

Define 10 subplots and arrange them into 5 rows and 2 columns.

In [None]:
fig, axes = plt.subplots(nrows=5, ncols=2)

View the axes component.

In [None]:
axes

Iterate through the axes.

In [None]:
for ax in axes:
    print(ax)

Iterate thru the column names in the data frame.

In [None]:
for i, col in enumerate(df.columns):
    print(i, col)

Isolate numeric columns.

In [None]:
num_cols = df.columns[3:]
num_cols

A better way to isolate numeric columns is by using `df.dtypes`.

In [None]:
df.columns[df.dtypes=='int64']

Check How many numeric columns we have.

In [None]:
print(len(num_cols))

Plot historgams for all numeric columns.

In [None]:
fig, axes = plt.subplots(nrows=5, ncols=2)

for i, ax in enumerate(axes.flat):
    col = num_cols[i]
    print(col)
    df[col].hist(bins=100, ax=ax);

Increase the subplot figure sizes.

In [None]:
fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(12, 9))

for i, ax in enumerate(axes.flat):
    col = num_cols[i]
    df[col].hist(bins=100, ax=ax)
    ax.set_xlabel(col, weight='bold', size=10)

Improve the plot layout.

In [None]:
fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(12, 9))

fig.tight_layout()

for i, ax in enumerate(axes.flat):
    col = num_cols[i]
    df[col].hist(bins=100, ax=ax)
    ax.set_xlabel(col, weight='bold', size=10)

Remove the gridlines.

In [None]:
fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(12, 9))
fig.tight_layout()

for i, ax in enumerate(axes.flat):
    col = num_cols[i]
    df[col].hist(bins=100, ax=ax, grid=False)
    ax.set_xlabel(col, weight='bold', size=10)

Use a color palette.

In [None]:
current_palette = sns.color_palette()
sns.palplot(current_palette)

_Seabron Documentation on 'Choosing color palettes': https://seaborn.pydata.org/tutorial/color_palettes.html_

Apply this color palette to the histograms.

In [None]:
fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(12, 9))
fig.tight_layout()

for i, ax in enumerate(axes.flat):
    col = num_cols[i]
    df[col].hist(bins=100, ax=ax, grid=False, color=sns.color_palette()[i])
    ax.set_xlabel(col, weight='bold', size=10)

In [None]:
fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(12, 9))
fig.tight_layout()

for i, ax in enumerate(axes.flat):
    col = num_cols[i]
    df[col].hist(bins=100, ax=ax, grid=False, color=sns.color_palette()[i])
    ax.set_xlabel(col, weight='bold', size=10)

plt.savefig(r'../output/histograms_top50.pdf')

### Violin/Box plots

The default violin plot.

In [None]:
sns.violinplot(x='Genre', y='Energy', data=df);

Check the distribution of `Genre`.

In [None]:
df.Genre.value_counts()

Combine similar generes to reduce cardinality.

In [None]:
import numpy as np

df.loc[:, 'Genre2'] = np.where(df.Genre.str.contains('pop'), 'POP',
                        np.where(df.Genre.str.contains('hip hop'), 'HIP HOP',
                                np.where(df.Genre.str.contains('rap'), 'RAP', 'OTHER')))

df.Genre2.value_counts()

View data to ensure the mapping was done correctly.

In [None]:
df[['Genre', 'Genre2']].sort_values(by='Genre2')

Plot the new genere column created above.

In [None]:
sns.violinplot(x='Genre2', y='Energy', data=df)

plt.show();

Increase the figure size.

In [None]:
plt.figure(figsize=[12, 9])

sns.violinplot(x='Genre2', y='Energy', data=df);

Box plot.

In [None]:
plt.figure(figsize=[12, 9])

sns.boxplot(x='Genre2', y='Energy', data=df);

Add axes labels.

In [None]:
plt.figure(figsize=[12, 9])

sns.violinplot(x='Genre2', y='Energy', data=df)

plt.xlabel('Genre', fontsize=14, weight='bold')
plt.ylabel('Energy', fontsize=14, weight='bold');

Change plot layout to add gridlines.

In [None]:
plt.figure(figsize=[12, 9])
sns.set_style('darkgrid')

sns.violinplot(x='Genre2', y='Energy', data=df)

plt.xlabel('Genre', fontsize=14, weight='bold')
plt.ylabel('Energy', fontsize=14, weight='bold');

### Scatter plot

Plot loudness versus energy.

In [None]:
plt.figure(figsize=[12, 9])

ax = sns.scatterplot(x='LoudnessdB', y='Energy', data=df, color='tomato')

plt.xlabel('Loundness', fontsize=14, weight='bold')
plt.ylabel('Energy', fontsize=14, weight='bold');

Increase the size of each dot.

In [None]:
plt.figure(figsize=[12, 9])

ax = sns.scatterplot(x='LoudnessdB', y='Energy', data=df, color='tomato', s=100)

plt.xlabel('Loundness', fontsize=14, weight='bold')
plt.ylabel('Energy', fontsize=14, weight='bold');

Label data points based on genre.

In [None]:
plt.figure(figsize=[12, 9])

ax = sns.scatterplot(x='LoudnessdB', y='Energy', data=df, color='tomato', s=100, hue='Genre2')

plt.xlabel('Loundness', fontsize=14, weight='bold')
plt.ylabel('Energy', fontsize=14, weight='bold')
plt.legend(loc='lower right', fontsize=12);

### Data Aggregation

How many total artists are there in this dataset?

In [None]:
df.ArtistName.nunique()

Frequency of occurence for each artist.

In [None]:
df.ArtistName.value_counts()

Let's analyze (the average) popularity by genere.

In [None]:
df.groupby('Genre2').size()

You can also calculate the % distribution by dividing each count by the total number of records in the data frame.

In [None]:
df.groupby('Genre2').size() / len(df)

Use `pandas` `groupby()` function to create an object.

In [None]:
df.groupby('Genre2')

View the size of each category of `Genre2`.

In [None]:
df.groupby('Genre2').size()

Create a series by adding an attribute (column).

In [None]:
df.groupby('Genre2')['Popularity']

Now you can run functions, such as `mean()`, `max()` etc, on the series.

In [None]:
df.groupby('Genre2')['Popularity'].mean()

Save the results.

In [None]:
genre_size = df.groupby('Genre2').size()

genre_popularity = df.groupby('Genre2')['Popularity'].mean()

Create a two-axis plot to view the genre size and average popularity.

In [None]:
sns.set(style = 'dark')
f, ax = plt.subplots(figsize = (12, 9))

ax2 = ax.twinx()

ax.bar(genre_size.index, genre_size, color='teal', alpha=0.5)

ax2.plot(genre_popularity.index, genre_popularity, color='orangered', lw=3)

ax.set_xlabel('Genre', fontsize=14, weight='bold')
ax.set_ylabel('Count', fontsize=14, weight='bold', color='teal')
ax2.set_ylabel('Average Populatiry', fontsize = 14, color='orangered', weight='bold');

Sort values from larger to smaller genre.

In [None]:
genre_size.sort_values(ascending=False)

In [None]:
genre_size = genre_size.sort_values(ascending=False)

genre_popularity = genre_popularity.sort_values(ascending=False)

### Join (combine) data

In [None]:
print(type(genre_size), type(genre_popularity))

Before we can join these two `pd.Series`, we must conver one of them into `pd.DataFrame` objects.

In [None]:
genre_size.to_frame()

In [None]:
genre_size.to_frame().join(genre_popularity)

See: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html

Reset (flatten) the multi-level column names.

In [None]:
genre_size.to_frame().join(genre_popularity).reset_index()

Let's save the results into a data frame.

In [None]:
genre_df = genre_size.to_frame().join(genre_popularity).reset_index()

genre_df.columns

Rename a column in this data frame.

In [None]:
genre_df.rename(columns={0: 'Count'})

In [None]:
genre_df = genre_df.rename(columns={0: 'Count'})

Plot the results with genre sorted from large to small.

In [None]:
sns.set(style = 'dark')
f, ax = plt.subplots(figsize = (12, 9))

ax2 = ax.twinx()

ax.bar(genre_df['Genre2'], genre_df['Count'], color='teal', alpha=0.3)

ax2.plot(genre_df['Genre2'], genre_df['Popularity'], color='orangered', marker='o', markersize=10)

ax.set_xlabel('Genre', fontsize=14, weight='bold')
ax.set_ylabel('Count', fontsize=14, weight='bold', color='teal')
ax2.set_ylabel('Average Populatiry', fontsize = 14, color='orangered', weight='bold');

### Plotting pairwise data relationships using `seaborn`

In [None]:
g = sns.PairGrid(df)

g.map(plt.scatter);