### There are two main reasons for creating visuals using data:

Exploratory Analysis is done when you are searching for insights. These visualizations don't need to be perfect. You are using plots to find insights, but they don't need to be aesthetically appealing. You are the consumer, and you need to be able to find the answer to your question from these plots.


Explanatory Analysis is done when you are providing your results for others. These visualizations need to provide you the emphasis you need to convey your message. They should be accurate, insightful, and visually appealing.

##### The five steps of the data analysis process:

* Extract - Obtain the data from a spreadsheet, SQL, the web, etc.

* Clean - Here we could use exploratory visuals.

* Explore - Here we use exploratory visuals.

* Analyze - Here we might use either exploratory or explanatory visuals.

* Share - Here is where explanatory visuals live.

### Python Data Visualization Libraries
In this course, you will make use of the following libraries for creating data visualizations:

* matplotlib: a versatile library for visualizations, but it can take some code effort to put together common visualizations.
* seaborn: built on top of matplotlib, adds a number of functions to make common statistical visualizations easier to generate.
* pandas: while this library includes some convenient methods for visualizing data that hook into matplotlib, we'll mainly be using it for its main purpose as a general tool for working with data.

All together, these libraries will allow us to visualize data in a balance of productivity and flexibility, for both exploratory as well as explanatory analyses.


### Design of Visualizations

#### Visuals can be bad if they:

* Don't convey the message.
* Are misleading.

### The Four Levels of Measurement
In order to choose an appropriate plot type or method of analysis for your data, you need to understand the types of data you have. One common method divides the data into four levels of measurement:

##### Qualitative or categorical types (non-numeric types)
1. `Nominal data`: pure labels without inherent order
2. `Ordinal data`: labels with an intrinsic order or ranking (comparison operations can be made between values)
##### Quantitative or numeric types
3. `Interval data`: numeric values where absolute differences are meaningful (addition and subtraction operations can be made)
4. `Ratio data`: numeric values where relative differences are meaningful (multiplication and division operations can be made)
All quantitative-type variables also come in one of two varieties: discrete and continuous.

Discrete quantitative variables can only take on a specific set values at some maximum level of precision.
Continuous quantitative variables can (hypothetically) take on values to any level of precision.
When exploring your data, the most important thing to consider first is whether your data is qualitative or quantitative. In later lessons, you will see how this distinction impacts your choice of plots.

Color can both help and hurt a data visualization. Three tips for using color effectively.

* Before adding color to a visualization, start with black and white.

* When using color, use less intense colors - not all the colors of the rainbow, which is the default in many software applications.

* Color for communication. Use color to highlight your message and separate groups of interest. Don't add color just to have color in your visualization.

In [None]:
## Constructing a graph for the nan values:
import pandas as pd
import numpy as np

## visualization tools in Python:
import matplotlib.pyplot as plt 
import seaborn as sb

% matplotlib inline


In [None]:
pokemon = pd.read_csv('pokemon.csv')
print(pokemon.shape)
pokemon.head(3)

In [None]:
sb.countplot(data = pokemon, x = 'generation_id');

## by using the  ';', I am suppressing the printing object information.
## We can see that most pokemon's were introduced in generation 1,3 and 5


### We can see that most pokemon's were generated in the 1,3 and 5th generation and least in 6 and 7th.

In [None]:
sb.color_palette()   # this list contains the colors in the set of the arguments. So accordingly I can choose colors.

In [None]:
## as I want just one color I will take one tuple in the list, and will just assign it to the variable.
base_color = sb.color_palette()[1]

In [None]:
## now the revised plot is much cleaner.
sb.countplot(data = pokemon, x = 'generation_id', color = base_color);

In [None]:
## Now I want to order the plot. As it is ordinal. 
# So I can just leverage the pandas 

print(pokemon['generation_id'].value_counts().index)  #by . index I am sorting it on the frequecny basis.


gen_order = pokemon['generation_id'].value_counts().index

In [None]:
# Now have sorted the values o the basis of frequencies.
sb.countplot(data = pokemon, x = 'generation_id', color = base_color, order = gen_order);


In [None]:
## Let us look at the 'type1' column. As it has got more frequencies:

sb.countplot(data = pokemon, x = 'type_1', color = base_color);
## we can notice that the column names are overlapping: This is something I need to avoid.

In [None]:
## just in order ot get rid of the overlapping of the names I am writhng below thing.
sb.countplot(data = pokemon, x = 'type_1', color = base_color);
plt.xticks(rotation = 90);
# I can use matplotliib function to rotate the variable names counter wise to restrain the overlap.

### we can see that grass and fire are the first type taken by the pokemon and flying with the least type

In [None]:
## Alternatively we can also change the frequencies from the rows to the columns:
# just by changing the types from 'x' to y.
sb.countplot(data = pokemon, y = 'type_1', color = base_color);

In [None]:
#3 just ordering the above types.
print(pokemon['type_1'].value_counts().index);

ordering = pokemon['type_1'].value_counts().index

sb.countplot(data = pokemon, y = 'type_1', color = base_color, order = ordering);

In [None]:
pokemon.columns

## Hitograms for Numerical data

In [None]:

plt.hist(data = pokemon, x = 'speed');

## What will we notice here
# by default matplotlib will plot 10 bins over here. 
# Bin boundaries are also not particularly aligned with tick marks making interpretation trickier

In [None]:
plt.hist(data = pokemon, x = 'speed', bins = 20)
# here I can see the bins edges and counts return by hist 
# fact that bins eges are non integers and the data values are integers, will mean that some bins will 
# cover more integer values than others
## Specifying the bin boundries explicitly will make bins more useful

In [None]:
## seaborn can also help in creating the histogram by using distplot

sb.distplot(pokemon['speed'])

# a line here is the data density estimate of the data distribution. total area under the curve is set to be 
# equal to one. 


In [None]:
## I can get rid of the density function 
sb.distplot(pokemon['speed'], kde = False);

In [None]:
bins = np.arange(0, pokemon['height'].max()+ 0.5, 0.5)
plt.hist(data = pokemon, x = 'height', bins = bins);