# Exploratory Data Analysis

Python Plotting for Exploratory Data Analysis https://pythonplot.com/

### Common visualizations for low-dimensional data:  
* DataFrame.plot library
* Histogram/KDE:   1D, Quantitative, distributions
* Bar:  2D, Cat x Quant, comparing points
* Pie: 2D, Cat x Quant, comparing points to whole
* Line/area:  2D, Quant(Continuous) x Quant, trend
* Scatter: 2D, Quant x Quant, correlation
* Annotating plots: Axes.text(), .add_line(), .add_path(), .annotate()

In [None]:
### Libraries:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

### Choose one of these graphical output styles:

# Static (most reliable). Remember magic commands
%matplotlib inline

### data import

In [None]:
s = pd.read_csv("data/Survey-clean.csv")

In [None]:
s.head()

## Common visualizations for low-dimensional data


### Histogram

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html

https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html

In [None]:
# Histogram
s.Height.plot.hist()

In [None]:
s.Height.plot.kde()

<span class="mark">**TODO**</span>
plot a histogram on `Age` from the survey data

In [None]:
## Your code below




In [None]:
# Also try with
# s.hist()

### Bar Chart

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.bar.html

In [None]:
# Bar chart
s.plot.bar(x='Name', y='Height', figsize=(20,3), color='steelblue')
# double click on the chart to expand

#### Dataset: Data from a Social Media platform (Reddit) 
Let's practice EDA with some social media data.

CSV File containing Reddit user activity form 2004

* author: Reddit username
* num_comments : number of comments made
* num_subs: number of subreddits participated in
* karma : total Reddit score accumulated 
* controv : total number of controvertial comments ---(comments with both, high upvotes and dowsvotes)
* gild : number of comments that received Reddit gold from other users 
* verbosity: average length of the comment

In [None]:
rdata = pd.read_csv("data/reddit_2007_author.csv")
rdata.head()

In [None]:
type(pd.to_numeric(rdata.controv))

<span class="mark">QUESTION:</span> Visually what's the difference in the histogram that you see here for Reddit versus the one you saw for height? What does it tell you about the distribution of data in both cases?

#### Plot the distribution of comments made by users. We are not interested in users who make > 200 comments.

*Hint: filter based on this condition*


<span class="mark">TODO:</span>

In [None]:
# Your code below.



In [None]:
# hint code: users who make more than 5000 comments.

rdata[rdata.num_comments > 5000]

#### Left-skewed distribution. 
A lot of people make few comments and a few people make a lot of comments skewing the histogram. Now lets look at the distribution of top 100 num_comments

In [None]:
f1 = rdata.sort_values(by='num_comments', ascending=False).head(100)
f1.num_comments.plot.hist()

In [None]:
# who made these to comments?

f1

### Bar chart

Plotting number of comments made by each user

In [None]:
f1.plot.bar(x='author', y='num_comments', figsize=(20,3), color='purple')

#### I want the axes to be properly labels. Too hard to read.

In [None]:
f1.plot.bar(x='author', y='num_comments', figsize=(20,3), color='purple')
plt.xlabel('Author', fontsize=16) # Add x & y label, change font size
plt.ylabel('# of Comments', fontsize=16)

<span class="mark">TODO</span>: The legend is too small. Can you change the size of the legend? *medium difficulty*

In [None]:
# Your code below




### Scatter plot

visualizing two quantitative (continuous) variables

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.scatter.html

In [None]:
# How does the karma earned by the user relate to user's contribution (# of comments)?

p = f1.plot.scatter(x='num_comments', y='karma', 
                    c = 'verbosity', #color
                    colormap=plt.cm.cool, 
                    #s=rdata.controv, #marker size
                    figsize=(15,10), 
                    alpha=0.5, #between 0 (transparent) and 1 (opaque).
                    sharex=False) # sharex = convinces xlabel to show
p.set_xlabel("Number of comments", fontsize=16)
p.set_ylabel("Karma", fontsize=16)

<span class="mark">**TODO**</span>
Try now with the survey data. **2 & 3 Difficult**

* 1. scatterplot of plot age vs. height
* 2. If you are able to do 1, then also add year 'Born' on your plot to show additional information about when the person was born
* 3. If you are able to do 1 and 2, then show number of siblings as a marker size attribute

In [None]:
# Your code below