# Introduction to Exploratory Data Analysis (EDA) using `pandas` (Tutorial)

This notebook is used as a tutorial version of the `03_intro_to_pandas_eda.ipynb` notebook, which contains the complete version of this notebook. The data for this exercise is borrowed from [here](https://www.kaggle.com/leonardopena/top50spotify2019).

### 1. Read data

In [None]:
import pandas as pd

df = pd.read_csv(r"../data/top50.csv", encoding='ISO-8859-2')

`pandas.read_csv()` documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

_Python encoding: https://docs.python.org/3/library/codecs.html#standard-encodings_

In [None]:
# view data


Display some basic information about this data frame.

You can also use `df.dtypes` to view data types for each column.

View column names.

### Drop a column

### Modify column names

View column names.

Convert the column-names array into a string object.

Example: Convert all column names into uppercase.

Remove dots from column names.

Save the results into the data frame.

### Correlations

Plot the correlation matrix.

In [None]:
# import necessary packages
import matplotlib.pyplot as plt
import seaborn as sns

# if needed, enable inline plots
#plt.ion()

_The `ion()` function causes the output of plotting commands to be displayed inline within frontends (like the Jupyter notebook), directly below the code cell that produced it. The resulting plots will then also be stored in the notebook document. More details here: https://ipython.readthedocs.io/en/stable/interactive/plotting.html_

In [None]:
# create a heat map


Increase the figure size.

In [None]:
# set figure size

# create a heat map


Change the color palette.

In [None]:
# define a color palette

# if needed, adjust the truncated top and bottom rows
#ax.set_ylim(len(corr_matrix), 0)

_Note: Depending on the version of `matplotlib` on your computer, some of you may see truncated top and bottom rows. This is a known bug; see https://github.com/matplotlib/matplotlib/issues/14751_

### Distributions

View summary statistics.

_`pandas.DataFrame.describe` documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html_

Transpose the summary statistics for better visibility.

Export the summary statistics.

### Histograms

Plot a histogram for one variable.

_`pandas.DataFrame.hist` documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html_

Plot histograms for multiple variables.

Change the bin size.

#### Format plots.

Define 10 subplots and arrange them into 5 rows and 2 columns.

View the axes component.

Iterate through the axes.

Iterate thru the column names in the data frame.

Isolate numeric columns.

A better way to isolate numeric columns is by using `df.dtypes`.

Check How many numeric columns we have.

Plot historgams for all numeric columns.

Increase the subplot figure sizes, improve the plot layout, and remove gridlines.

Use a color palette.

_Seabron Documentation on 'Choosing color palettes': https://seaborn.pydata.org/tutorial/color_palettes.html_

Apply this color palette to the histograms, and export the plot.

### Violin/Box plots

The default violin plot.

Check the distribution of `Genre`.

Combine similar generes to reduce cardinality.

View data to ensure the mapping was done correctly.

Plot the new genere column created above.

Box plot.

Add axes labels.

Change plot layout to add gridlines.

### Scatter plot

Plot loudness versus energy.

Label data points based on genre.

### Data Aggregation

How many total artists are there in this dataset?

Frequency of occurence for each artist.

Let's analyze (the average) popularity by genere.

You can also calculate the % distribution by dividing each count by the total number of records in the data frame.

Use `pandas` `groupby()` function to create an object.

View the size of each category of `Genre2`.

Create a series by adding an attribute (column).

Now you can run functions, such as `mean()`, `max()` etc, on the series.

Save the results.

Create a two-axis plot to view the genre size and average popularity.

Sort values from larger to smaller genre.

### Join (combine) data

Before we can join these two `pd.Series`, we must conver one of them into `pd.DataFrame` objects.

See: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html

Reset (flatten) the multi-level column names.

Let's save the results into a data frame.

Rename a column in this data frame.

Plot the results with genre sorted from large to small.

### Plotting pairwise data relationships using `seaborn`