# (Golden Rules for) Exploratory Data Analysis (EDA)

What's below is some hard earned wisdom. Please accept this into your heart and go forth such that data never harms you as it has so many before you. 

```{figure} https://media.giphy.com/media/Wn74RUT0vjnoU98Hnt/source.gif
---
height: 300px
name: baby-yodSSSSS
---
You, soon to be wise
```


````{dropdown} 1. **GOLDEN RULES for EDA**

You got new data - fun! But before you go speeding off into analysis, first you must learn some basic facts about the data. 

You should **always, always, always** open your datasets up to physically (digitally?) look at them[^member] and then also generate summary statistics. Here are _some_ basic outputs you should examine carefully. 

1. Print the first and last five rows: `df.head()` and `df.tail()`
2. What is the shape (# rows and variables) of the data? `df.shape`
1. Print the column names
3. How much memory does it take, and what are the variable names/types? `df.info()`
4. Summary stats on variables: 
    - `df.describe()` - count, mean, sd, etc
    - `df['var'].value_counts()[:10]`  - the most common values of a variable
    - `df['var'].nunique()` - the number of unique values of a variable
    
```{tip}
Automate your initial EDA by putting something like the below in your codebook.

You can refine this as you go - you'll see more and more tips how to improve your EDA as we go.
```

```{warning}
Two things:
1. Don't just run these and move on! Actually **look** at the output and check for possible issues. (Some possible issues are listed in the next drop down.)
1. This isn't a comprehensive list of things I'd do to check datasets, merely a reasonable start!

```

````

[^member]: [Remember this?](../01/07_debugging.html#seriously-print-your-data-and-objects-often)

````{dropdown} 2. Data cleaning

```{admonition} A thought not quite profound enough to be wisdom
:class: tip
Data cleaning, exploration, and analysis exist in a never ending feedback loop. Analysis projects are rarely linear. You'll be doing some analysis and realize there is a data problem, or that you need new data, and you're back at step one above.
```
 
Here are some things you might look for while cleaning your data:

| Look for | Comment |
| :--- | :--- |
Do some variables have large outliers? | You might need to winsorize, or drop, those observations.
Do some variables have many missing variables? | Maybe there was a problem with the source or how you loaded it.
Do some variables have impossible values? | For example, sales can't be negative! Yet, some datasets use "-99" to indicate missing data. Or are there bad dates (June 31st)?
Are there duplicate observations? | E.g. two 2001-GM observations because they changed their fiscal year in the middle of calendar year 2001?
Are there missing observations or a gap in the time series? | E.g. no observation in 2005 for Enron because the executive team was too "distracted to file its 10-K?
Do variables contradict each other? | E.g. If my birthday is 1/5/1999, then I must be 21 as of this writing.

````

````{dropdown} 3. Institutional knowledge is crucial

```{tip}
Lehigh offers students free subscriptions to WSJ, NYT, and FT!
```

- For example, in my research on firm investment policies and financing decisions (e.g. leverage), I drop any firms from the finance or utility industries. I know to do this because those industries are subject to several factors that fundamentally change their leverage in ways unrelated to their investment policies. Thus, including them would contaminate any relationships I find.
- When we work with 10-K text data, we need to know what data tables are available and how to remove boilerplate. 
- You only get institutional through reading supporting materials, documentation, related info (research papers and WSJ, etc), and... the documents themselves. For example, I've read in excess of 6,000 proxy statements for a single research project. (If that sounds exciting: Go to grad school and become a prof!)


````

```{dropdown} 4. Explore the data with your question in mind

- Compute statistics by subgroups (by age, or industry), or two-way (by gender and age)
- Do a big correlation matrix to get a sense of possible relationships between variables in the data on a first pass. (We will do this later.)
- This step reinforces institutional knowledge and your understanding of the data

Remember, **fata exploration** and **data cleaning** are interrelated and you'll go back and forth from one to the other. 

```

## Toy EDA 

I do all of suggested steps here. The output isn't pretty, and I'd clean it up if I was writing a report. But this gets the ball rolling.

```{note}
This data only has one categorical variable (species). You should do `value_counts` and `nunique` for all categorical variables.
```


In [1]:
# import a famous dataset, seaborn nicely contains it out of the box!
import seaborn as sns 
iris = sns.load_dataset('iris') 

print(iris.head(),  '\n---')
print(iris.tail(),  '\n---')
print(iris.columns, '\n---')
print("The shape is: ",iris.shape, '\n---')
print("Info:",iris.info(), '\n---') # memory usage, name, dtype, and # of non-null obs (--> # of missing obs) per variable
print(iris.describe(), '\n---') # summary stats, and you can customize the list!
print(iris['species'].value_counts()[:10], '\n---')
print(iris['species'].nunique(), '\n---')

   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa 
---
     sepal_length  sepal_width  petal_length  petal_width    species
145           6.7          3.0           5.2          2.3  virginica
146           6.3          2.5           5.0          1.9  virginica
147           6.5          3.0           5.2          2.0  virginica
148           6.2          3.4           5.4          2.3  virginica
149           5.9          3.0           5.1          1.8  virginica 
---
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object') 
---
The shape is:  (150, 5) 
---
<class 'pandas.core.frame.DataFrame'>
RangeIndex