# Basic Data Visualizations
This module shows a few different techniques for retreiving and visualizing data using pandas and matplotlib. We will also work with the original cars datasest that I've posted to Kaggle. You will need to add that dataset to your notebook for some of the examples to work.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

To retreive some data to work with, you'll use a library called `pandas_datreader`, which allows you to connect to multiple external datasources. The documentation is here: [https://pydata.github.io/pandas-datareader/#](http://) (Note that some data sources require you to get an API key – you can implement that as a variable in your Kaggle Secrets. None of the examples below requires this.)

In [None]:
import pandas_datareader.data as web

You will need the matplotlib library so that we can make charts. It is common practice to import it **as** `plt` - that's fewer characters to type, every time you want to access the functions. You also import `datetime`, to provide useful functions for working with dates (like getting the current time).


In [None]:
import matplotlib.pyplot as plt
import datetime as dt

Next you'll retrieve stock tickers as an easy-to-access source of data to practice with. Then create a list to store the stock tickers.

In [None]:
# Define the instruments to download. We would like to see Apple, Microsoft and the S&P500 index.
tickers = ['AAPL'] #, 'MSFT', 'IBM'] (you can add more tickers as a list)

# We would like all available data from 01/01/2017 until 12/31/2017.
start_date = '2017-01-01' # you can set this to whatever date you want
end_date = dt.datetime.now() # this puts the current time into a variable called end_date

# This next function creates a pandas dataframe containing the results of the DataReader query
# The 'yahoo' datasource provides the stock ticker info. (google and morningstar no longer work).
# The results are stored as a dataframe called df (nice and short!)
df = web.DataReader(tickers, data_source='yahoo', start=start_date, end=end_date)
# Inspect the first 5 rows
df.head()

Now for the first visualization! You use the matplotlib library's plot function to access a basic line graph. It can take many paramenters, but it needs at least the data to work with and plot on the y-axis, which can be requested from the column headings you just retrieved into the new dataframe. You can plot the low closing value from the 'Low' coumn, for example.

In [None]:
df.plot(y='Low') 

Plotting multiple values is easy. Just specify which columns of the dataframe you want to plot.

In [None]:
df[["High", "Low"]].plot()


You can also change the aesthetics of the plot to meet your. There are a lot of pre-set styles that you can choose from (easiest) or you can make your own by modifying specific parameters of the plot function (harder). To list the available styles, use the `style.available` function.

In [None]:
plt.style.available

To use a specific style, call the `style.use` function and set the parameter to the name of the style you want. You need to call this function every time you want to change the style.

In [None]:
plt.style.use("fivethirtyeight") #need to reset this every time you want to change the template
df[["High", "Low"]].plot()

In [None]:
plt.style.use("ggplot")
df[["High", "Low"]].plot()

## Bar Charts
You can also easily plot bar charts usling matplotlib. Bar charts are good representations for ranking categorical and nominal data. This example uses Google stock data to create categories of how many closing days were Poor, Good, or Stellar, depending on how they compare to the avarage closing value over the whole time period.

Suppose you want to answer the question: *"How many closing stock prices were low medium or high compared to the average closing price?"*

To do this, you need to know the average price over that time period and to create three categories for the closing values, compared to that average. You can use python to create categories of data from the stock prices. 

First get stock prices for Google (over the same time period as above).

Then calculate what the average (mean) price was over that time period.

In [None]:
google = web.DataReader('GOOG', data_source='yahoo', start=start_date, end=end_date)

google['Close'].mean()

In [None]:
google

You can use the mean price over that period to create three categories – depending upon whether the closing price on a day was lower, near it or above it.

To do this create a function that you use to evaluate each price and set it's **rank performace**. You will pass this function the price on each row of the dataframe

In [None]:
def rank_performance(stock_price):
    if stock_price <= 900:
        return "Poor"
    elif stock_price>900 and stock_price <=1200:
        return "Good"
    elif stock_price>1200:
        return "Stellar"

You then run this custom function against each of the values in the **Close** column.

In [None]:
google['Close'].apply(rank_performance)

Note that the values haven't actually changed in the resulting data - you've simply stored the ranking for each value in the Close column in the datareader object. To show the data hasn't changed, just view the object:

In [None]:
google

Toe finally create the bar chart of categories, you need to count how many times each ranking occurred. Conveninetly, the `value_counts()` function does this. If you use dot "." notation to append this function to the other ones, you don't have to create an intermediate variable to store the counts. You can just pass along the results right on to the `.plot()` function. In this way, you are concatenating the results of each step with the "dot" notation. Note the `kind` parameter sets it to a bar chart.

*get coogle 'Close' . -> apply the rank performance function . -> count the results . -> plot the results*

In [None]:
google['Close'].apply(rank_performance).value_counts().plot(kind="bar")

If for some reason, you wanted a horizontal bar chart, just set the `kind` parameter to `"barh"`.

In [None]:
google["Close"].apply(rank_performance).value_counts().plot(kind="barh")

## Pie Charts
It is similarly easy to plot categories with a pie cahrt, to create a part-to-whole comparison.

First you load the results of the `DataReader` into a new variable to work with. Let's take Johnson & Johnson for example.

In [None]:
jnj = web.DataReader('JNJ', data_source='yahoo', start='2016-01-01', end=dt.datetime.now())
jnj.head()

How did performance each day compare to its average?
First let's find out the average:

In [None]:
jnj['Close'].mean()

We can write another custom performance to determine whether each value is above or below the average score over this time period.

In [None]:
def above_or_below(stock_price):
    if stock_price >= 128.33:
        return "Above average"
    else:
        return "Below average"

You can then create a pie based upon the values for the results of your custom function. Note the styling choices in this example. A full list of the styling parameters is in the matplotlib documentation. [https://matplotlib.org/3.1.0/api/_as_gen/matplotlib.pyplot.pie.html](http://)

In [None]:
labels='above','below'
colors = ['mediumseagreen','lightcoral'] 
jnj["Close"].apply(above_or_below).value_counts().plot(kind='pie', legend=False, labels=labels, colors=colors)

## Scatter Plots

Scatterplots require at least two columns of data, because you need to specify which axes to compare. To try out these examples, you need my `original cars.csv` dataset, on Kaggle. `read_csv()` function to create a dataframe from the file.

In [None]:
cars = pd.read_csv("../input/original-cars/original cars.csv")
cars # show the head and tail of this file

To show what a generic scatterplot might look like you can create a bunch of random points and make them have random weights.

In [None]:
N = 50
x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
size = (30 * np.random.rand(N))**2  # 0 to 15 point radii

plt.scatter(x, y, s=size, c=colors, alpha=0.5)
plt.show()

In [None]:
x=cars[['MPG']]
y=cars[['Horsepower']]
cars[['MPG','Horsepower']].plot(kind='scatter', x='MPG', y='Horsepower', alpha=0.5)

In [None]:
cars[['Acceleration','Horsepower']].plot(kind='scatter',x='Acceleration', y='Horsepower',  alpha=0.5, legend=True)

You can access the size parameter to change how big the dots are.

In [None]:
size=cars[['Displacement']]
cars[['MPG','Horsepower']].plot(kind='scatter', x='MPG', y='Horsepower', alpha=0.5, legend=True, s=size, figsize=(12,8))

## Distributions
You can easily plot the distribution of values in an axis (i.e., column) using the matplotlib `hist()` function. If you specify no parameters for which column you get them all!

In [None]:
hist=cars.hist(column='MPG')

If you specify no parameters for which column you get them all!In the example below, the figure is made larger so that the histograms don't overlap each other.

In [None]:
hist=cars.hist(figsize=(12,8))

You can customize the histogram by providing the hist() method additional parameters and matplotlib styling:

In [None]:
hist = cars.hist(column='MPG', bins=10, grid=False, figsize=(12,8), color='#4290be', zorder=2, rwidth=0.9)

hist = hist[0] # each unique value is accessed by its index (the car name) which is in clumn 0

for x in hist:

    # Switch off tickmarks
    x.tick_params(axis="both", which="both", bottom=False, top=True, labelbottom=True, left=False, right=False, labelleft=True)

    # Draw horizontal axis lines
    vals = x.get_yticks()
    for tick in vals:
        x.axhline(y=tick, linestyle='dashed', alpha=0.4, color='#eeeeee', zorder=1)

    # Set title (set to "" for no title!)
    x.set_title("Cars and MPG")

    # Set x-axis label
    x.set_xlabel("Miles per Gallon", labelpad=20, weight='bold', size=12)

    # Set y-axis label
    x.set_ylabel("Number of cars", labelpad=20, weight='bold', size=12)