In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from numpy import NaN
from glob import glob
import re

In [None]:
pd.set_option('max_columns', 200)
pd.set_option('max_rows', 300)
pd.set_option('display.expand_frame_repr', True)

# Review of Pandas DataFrames

## Data ingestion & inspection

### pandas DataFrames

* Example: DataFrame of Apple Stock data

In [None]:
AAPL = pd.read_csv(r'DataCamp-master/11-pandas-foundations/_datasets/AAPL.csv',
                   index_col='Date', parse_dates=True)

In [None]:
AAPL.head()

* The rows are labeled by a special data structure called an Index.
    * Indexes in Pandas are tailored lists of labels that permit fast look-up and some powerful relational operations.
* The index labels in the AAPL DataFrame are dates in reverse chronological order.
* Labeled rows & columns improves the clarity and intuition of many data analysis tasks.

In [None]:
type(AAPL)

In [None]:
AAPL.shape

In [None]:
AAPL.columns

In [None]:
type(AAPL.columns)

In [None]:
AAPL.index

In [None]:
type(AAPL.index)

* DataFrames can be sliced like NumPy arrays or Python lists using colons to specify the start, end and stride of a slice.

In [None]:
# Start of the DataFrame to the 5th row, inclusive of all columns
AAPL.iloc[:5,:]

In [None]:
# Start at the 5th last row to the end of the DataFrame using a negative index
AAPL.iloc[-5:,:]

In [None]:
AAPL.head()

In [None]:
AAPL.tail()

In [None]:
AAPL.info()

In [None]:
AAPL.Close.plot(kind='line')

# Add first subplot
plt.subplot(2, 1, 1)
AAPL.Close.plot(kind='line')

# Add title and specify axis labels
plt.title('Close')
plt.ylabel('Value - $')
plt.xlabel('Year')

# Add second subplot
plt.subplot(2, 1, 2)
AAPL.Volume.plot(kind='line')

# Add title and specify axis labels
plt.title('Volume')
plt.ylabel('Number of Shares')
plt.xlabel('Year')

# Display the plots
plt.tight_layout()
plt.show()

### Broadcasting

* Assigning scalar value to column slice broadcasts value to each row

In [None]:
AAPL.iloc[::3, -1] = np.nan  # every 3rd row of Volume is now NaN

In [None]:
AAPL.head(7)

In [None]:
AAPL.info()

* Note Volume now has few non-null numbers

### Series

In [None]:
low = AAPL.Low

In [None]:
type(low)

In [None]:
low.head()

In [None]:
lows = low.values

In [None]:
type(lows)

In [None]:
lows[0:5]

* A Pandas Seriew, then, is a 1D labeled NumPy array and a DataFrame is a 2D labeled array whose columns as Series

### Inspecting your data

You can use the DataFrame methods ```.head()``` and ```.tail()``` to view the first few and last few rows of a DataFrame. In this exercise, we have imported pandas as ```pd``` and loaded population data from 1960 to 2014 as a DataFrame ```df```. This dataset was obtained from the World Bank.

Your job is to use ```df.head()``` and ```df.tail()``` to verify that the first and last rows match a file on disk. In later exercises, you will see how to extract values from DataFrames with indexing, but for now, manually copy/paste or type values into assignment statements where needed. Select the correct answer for the first and last values in the ```'Year'``` and ```'Total Population'``` columns.

### Instructions

Possible Answers
* First: 1980, 26183676.0; Last: 2000, 35.
* First: 1960, 92495902.0; Last: 2014, 15245855.0.
* First: 40.472, 2001; Last: 44.5, 1880.
* First: CSS, 104170.0; Last: USA, 95.203.

In [None]:
wb_df = pd.read_csv(r'DataCamp-master/11-pandas-foundations/_datasets/world_ind_pop_data.csv')

In [None]:
wb_df.head()

In [None]:
wb_df.tail()

### DataFrame data types

Pandas is aware of the data types in the columns of your DataFrame. It is also aware of null and ```NaN``` ('Not-a-Number') types which often indicate missing data. In this exercise, we have imported pandas as ```pd``` and read in the world population data which contains some ```NaN``` values, a value often used as a place-holder for missing or otherwise invalid data entries. Your job is to use ```df.info()``` to determine information about the total count of ```non-null``` entries and infer the total count of ```'null'``` entries, which likely indicates missing data. Select the best description of this data set from the following:

### Instructions

Possible Answers
* The data is all of type float64 and none of it is missing.
* The data is of mixed type, and 9914 of it is missing.
* The data is of mixed type, and 3460 float64s are missing.
* The data is all of type float64, and 3460 float64s are missing.

```python
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13374 entries, 0 to 13373
Data columns (total 5 columns):
CountryName                      13374 non-null object
CountryCode                      13374 non-null object
Year                             13374 non-null int64
Total Population                 9914 non-null float64
Urban population (% of total)    13374 non-null float64
dtypes: float64(2), int64(1), object(2)
memory usage: 522.5+ KB
```

In [None]:
wb_df.info()

### NumPy and pandas working together
Pandas depends upon and interoperates with NumPy, the Python library for fast numeric array computations. For example, you can use the DataFrame attribute ```.values``` to represent a DataFrame ```df``` as a NumPy array. You can also pass pandas data structures to NumPy methods. In this exercise, we have imported pandas as ```pd``` and loaded world population data every 10 years since 1960 into the DataFrame ```df```. This dataset was derived from the one used in the previous exercise.

Your job is to extract the values and store them in an array using the attribute ```.values```. You'll then use those values as input into the NumPy ```np.log10()``` method to compute the base 10 logarithm of the population values. Finally, you will pass the entire pandas DataFrame into the same NumPy ```np.log10()``` method and compare the results.

### Instructions

* Import ```numpy``` using the standard alias ```np```.
* Assign the numerical values in the DataFrame ```df``` to an array ```np_vals``` using the attribute ```values```.
* Pass ```np_vals``` into the NumPy method ```log10()``` and store the results in ```np_vals_log10```.
* Pass the entire ```df``` DataFrame into the NumPy method ```log10()``` and store the results in ```df_log10```.
* Inspect the output of the ```print()``` code to see the ```type()``` of the variables that you created.

In [None]:
pop_df = pd.read_csv(r'DataCamp-master/11-pandas-foundations/_datasets/world_population.csv')

In [None]:
pop_df.info()

In [None]:
# Create array of DataFrame values: np_vals
np_vals = pop_df.values

In [None]:
np_vals

In [None]:
# Create new array of base 10 logarithm values: np_vals_log10
np_vals_log10 = np.log10(np_vals)

In [None]:
np_vals_log10

In [None]:
# Create array of new DataFrame by passing df to np.log10(): df_log10
pop_df_log10 = np.log10(pop_df)

In [None]:
pop_df_log10

In [None]:
# Print original and new data containers
[print(x, 'has type', type(eval(x))) for x in ['np_vals', 'np_vals_log10', 'pop_df', 'pop_df_log10']]

***As a data scientist, you'll frequently interact with NumPy arrays, pandas Series, and pandas DataFrames, and you'll leverage a variety of NumPy and pandas methods to perform your desired computations. Understanding how NumPy and pandas work together will prove to be very useful.***

### Building DataFrames from Scratch

* DataFrames read in from CSV
```python
pd.read_csv()
```

* DataFrames from dict (1)

In [None]:
data = {'weekday': ['Sun', 'Sun', 'Mon', 'Mon'],
        'city': ['Austin', 'Dallas', 'Austin', 'Dallas'],
        'visitors': [139, 237, 326, 456],
        'signups': [7, 12, 3, 5]}

In [None]:
users = pd.DataFrame(data)

In [None]:
users

* DataFrames from dict (2)
    * lists

In [None]:
cities = ['Austin', 'Dallas', 'Austin', 'Dallas']
signups = [7, 12, 3, 5]
weekdays = ['Sun', 'Sun', 'Mon', 'Mon']
visitors = [139, 237, 326, 456]

list_labels = ['city', 'signups', 'visitors', 'weekday']
list_cols = [cities, signups, visitors, weekdays]  # list of lists

zipped = list(zip(list_labels, list_cols))  # tuples
zipped

* DataFrames from dict (3)

In [None]:
data2 = dict(zipped)

In [None]:
users2 = pd.DataFrame(data2)

In [None]:
users2

### Broadcasting

* Saves time by generating long lists, arrays or columns without loops

In [None]:
users['fees'] = 0  # Broadcasts value to entire column

In [None]:
users

### Broadcasting with a dict

In [None]:
heights = [59.0, 65.2, 62.9, 65.4, 63.7, 65.7, 64.1]

In [None]:
data = {'height': heights, 'sex': 'M'}  # M is broadcast to the entire column

In [None]:
results = pd.DataFrame(data)

In [None]:
results

### Index and columns

* We can assign list of strings to the attributes columns and index as long as they are of suitable length.

In [None]:
results.columns = ['height (in)', 'sex']

In [None]:
results.index = ['A', 'B', 'C', 'D', 'E', 'F', 'G']

In [None]:
results

### Zip lists to build a DataFrame

In this exercise, you're going to make a pandas DataFrame of the top three countries to win gold medals since 1896 by first building a dictionary. ```list_keys``` contains the column names ```'Country'``` and ```'Total'```. ```list_values``` contains the full names of each country and the number of gold medals awarded. The values have been taken from [Wikipedia](#https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table).

Your job is to use these lists to construct a list of tuples, use the list of tuples to construct a dictionary, and then use that dictionary to construct a DataFrame. In doing so, you'll make use of the ```list()```, ```zip()```, ```dict()``` and ```pd.DataFrame()``` functions. Pandas has already been imported as pd.

Note: The [zip()](#https://docs.python.org/3/library/functions.html#zip) function in Python 3 and above returns a special zip object, which is essentially a generator. To convert this ```zip``` object into a list, you'll need to use ```list()```. You can learn more about the ```zip()``` function as well as generators in [Python Data Science Toolbox (Part 2)](#https://www.datacamp.com/courses/python-data-science-toolbox-part-2).

### Instructions

* Zip the 2 lists ```list_keys``` and ```list_values``` together into one list of (key, value) tuples. Be sure to convert the ```zip``` object into a list, and store the result in ```zipped```.
* Inspect the contents of ```zipped``` using ```print()```. This has been done for you.
* Construct a dictionary using ```zipped```. Store the result as ```data```.
* Construct a DataFrame using the dictionary. Store the result as ```df```.

In [None]:
list_keys = ['Country', 'Total']
list_values = [['United States', 'Soviet Union', 'United Kingdom'], [1118, 473, 273]]

In [None]:
zipped = list(zip(list_keys, list_values))  # tuples
zipped

In [None]:
data = dict(zipped)

In [None]:
data

In [None]:
data_df = pd.DataFrame.from_dict(data)

In [None]:
data_df

### Labeling your data

You can use the DataFrame attribute ```df.columns``` to view and assign new string labels to columns in a pandas DataFrame.

In this exercise, we have imported pandas as ```pd``` and defined a DataFrame ```df``` containing top Billboard hits from the 1980s (from [Wikipedia](#https://en.wikipedia.org/wiki/List_of_Billboard_Hot_100_number-one_singles_of_the_1980s#1980)). Each row has the year, artist, song name and the number of weeks at the top. However, this DataFrame has the column labels ```a, b, c, d```. Your job is to use the ```df.columns``` attribute to re-assign descriptive column labels.

### Instructions

* Create a list of new column labels with ```'year'```, ```'artist'```, ```'song'```, ```'chart weeks'```, and assign it to ```list_labels```.
* Assign your list of labels to ```df.columns```.

In [None]:
billboard_values = np.array([['1980', 'Blondie', 'Call Me', '6'],
                             ['1981', 'Chistorpher Cross', 'Arthurs Theme', '3'],
                             ['1982', 'Joan Jett', 'I Love Rock and Roll', '7']]).transpose()
billboard_keys = ['a', 'b', 'c', 'd']

billboard_zipped = list(zip(billboard_keys, billboard_values))
billboard_zipped

In [None]:
billboard_dict = dict(billboard_zipped)

In [None]:
billboard_dict

In [None]:
billboard = pd.DataFrame.from_dict(billboard_dict)

In [None]:
billboard

In [None]:
# Build a list of labels: list_labels
list_labels = ['year', 'artist', 'song', 'chart weeks']

In [None]:
# Assign the list of labels to the columns attribute: df.columns
billboard.columns = list_labels

In [None]:
billboard

### Building DataFrames with broadcasting

You can implicitly use 'broadcasting', a feature of NumPy, when creating pandas DataFrames. In this exercise, you're going to create a DataFrame of cities in Pennsylvania that contains the city name in one column and the state name in the second. We have imported the names of 15 cities as the list ```cities```.

Your job is to construct a DataFrame from the list of cities and the string ```'PA'```.

### Instructions

* Make a string object with the value 'PA' and assign it to state.
* Construct a dictionary with 2 key:value pairs: 'state':state and 'city':cities.
* Construct a pandas DataFrame from the dictionary you created and assign it to df

In [None]:
cities = ['Manheim', 'Preston park', 'Biglerville',
          'Indiana', 'Curwensville', 'Crown',
          'Harveys lake', 'Mineral springs', 'Cassville',\
          'Hannastown', 'Saltsburg', 'Tunkhannock',
          'Pittsburgh', 'Lemasters', 'Great bend']

In [None]:
# Make a string with the value 'PA': state
state = 'PA'

In [None]:
# Construct a dictionary: data
data = {'state': state, 'city': cities}

In [None]:
# Construct a DataFrame from dictionary data: df
pa_df = pd.DataFrame.from_dict(data)

In [None]:
# Print the DataFrame
print(pa_df)

### Importing & Exporting Data

* Dataset: Sunspot observations collected from SILSO

```python
Format: Comma Separated values (adapted for import in spreadsheets)
The separator is the semicolon ';'.

Contents:
Column 1-3: Gregorian calendar date
- Year
- Month
- Day
Column 4: Date in fraction of year.
Column 5: Daily total sunspot number. A value of -1 indicates that no number is available for that day (missing value).
Column 6: Daily standard deviation of the input sunspot numbers from individual stations.
Column 7: Number of observations used to compute the daily value.
Column 8: Definitive/provisional indicator. '1' indicates that the value is definitive. '0' indicates that the value is still provisional.
```

In [None]:
filepath = r'data/silso_sunspot_data_1818-2019.csv'

In [None]:
sunspots = pd.read_csv(filepath, sep=';')
sunspots.info()

In [None]:
sunspots.iloc[10:20, :]

#### Problems

* CSV file has no column headers
    * Columns 0-2: Gregorian date (year, month, day)
    * Column 3: Date as fraction as year
    * Column 4: Daily total sunspot number
    * Column 5: Definitive / provisional indicator (1 OR 0)
* Missing values in column 4: indicated by -1
* Date representation inconvenient

In [None]:
sunspots = pd.read_csv(filepath, sep=';', header=None)
sunspots.iloc[10:20, :]

#### Using names keyword

In [None]:
col_names = ['year', 'month', 'day', 'dec_date',
             'tot_sunspots', 'daily_std', 'observations', 'definite']

In [None]:
sunspots = pd.read_csv(filepath, sep=';', header=None, names=col_names)
sunspots.iloc[10:20, :]

#### Using na_values keyword (1)

In [None]:
sunspots = pd.read_csv(filepath, sep=';',
                       header=None,
                       names=col_names,
                       na_values='-1')
sunspots.iloc[10:20, :]

#### Using na_values keyword (2)

In [None]:
sunspots = pd.read_csv(filepath, sep=';',
                       header=None,
                       names=col_names,
                       na_values='  -1')
sunspots.iloc[10:20, :]

In [None]:
sunspots.info()

#### Using na_values keyword (3)

In [None]:
sunspots = pd.read_csv(filepath, sep=';',
                       header=None,
                       names=col_names,
                       na_values={'tot_sunspots':['  -1'],
                                  'daily_std':['-1']})
sunspots.iloc[10:20, :]

#### Using parse_dates keyword

In [None]:
sunspots = pd.read_csv(filepath, sep=';',
                       header=None,
                       names=col_names,
                       na_values={'tot_sunspots':['  -1'],
                                  'daily_std':['-1']},
                       parse_dates=[[0, 1, 2]])
sunspots.iloc[10:20, :]

#### Inspecting DataFrame

In [None]:
sunspots.info()

#### Using dates as index

In [None]:
sunspots.index = sunspots['year_month_day']
sunspots.index.name = 'date'
sunspots.iloc[10:20, :]

In [None]:
sunspots.info()

#### Trimming redundant columns

In [None]:
cols = ['tot_sunspots', 'daily_std', 'observations', 'definite']
sunspots = sunspots[cols]
sunspots.iloc[10:20, :]

#### Writing files

```python
out_csv = 'sunspots.csv'
sunspots.to_csv(out_csv)
out_tsv = 'sunspots.tsv'
sunspots.to_csv(out_tsv, sep='\t')
out_xlsx = 'sunspots.xlsx'
sunspots.to_excel(out_xlsx)
```

### Reading a flat file

In previous exercises, we have preloaded the data for you using the pandas function ```read_csv()```. Now, it's your turn! Your job is to read the World Bank population data you saw earlier into a DataFrame using ```read_csv()```. The file is available in the variable ```data_file```.

The next step is to reread the same file, but simultaneously rename the columns using the ```names``` keyword input parameter, set equal to a list of new column labels. You will also need to set ```header=0``` to rename the column labels.

Finish up by inspecting the result with ```df.head()``` and ```df.info()``` in the IPython Shell (changing ```df``` to the name of your DataFrame variable).

```pandas``` has already been imported and is available in the workspace as ```pd```.

### Instructions

* Use ***pd.read_csv()*** with the string ***data_file*** to read the CSV file into a DataFrame and assign it to ***df1***.
* Create a list of new column labels - ***'year'***, ***'population'*** - and assign it to the variable ***new_labels***.
* Reread the same file, again using ***pd.read_csv()***, but this time, add the keyword arguments ***header=0*** and ***names=new_labels***. Assign the resulting DataFrame to ***df2***.
* Print both the ***df1*** and ***df2*** DataFrames to see the change in column names. This has already been done for you.

In [None]:
data_file = 'DataCamp-master/11-pandas-foundations/_datasets/world_population.csv'

In [None]:
# Read in the file: df1
df1 = pd.read_csv(data_file)

In [None]:
# Create a list of the new column labels: new_labels
new_labels = ['year', 'population']

In [None]:
# Read in the file, specifying the header and names parameters: df2
df2 = pd.read_csv(data_file, header=0, names=new_labels)

In [None]:
# Print both the DataFrames
df1.head()

In [None]:
df2.head()

### Delimiters, headers, and extensions

Not all data files are clean and tidy. Pandas provides methods for reading those not-so-perfect data files that you encounter far too often.

In this exercise, you have monthly stock data for four companies downloaded from [Yahoo Finance](#http://finance.yahoo.com/). The data is stored as one row for each company and each column is the end-of-month closing price. The file name is given to you in the variable ```file_messy```.

In addition, this file has three aspects that may cause trouble for lesser tools: multiple header lines, comment records (rows) interleaved throughout the data rows, and space delimiters instead of commas.

Your job is to use pandas to read the data from this problematic ```file_messy``` using non-default input options with ```read_csv()``` so as to tidy up the mess at read time. Then, write the cleaned up data to a CSV file with the variable ```file_clean``` that has been prepared for you, as you might do in a real data workflow.

You can learn about the option input parameters needed by using ```help()``` on the pandas function ```pd.read_csv()```.

### Instructions

* Use ***pd.read_csv()*** without using any keyword arguments to read ***file_messy*** into a pandas DataFrame ***df1***.
* Use ***.head()*** to print the first 5 rows of ***df1*** and see how messy it is. Do this in the IPython Shell first so you can see how modifying ***read_csv()*** can clean up this mess.
* Using the keyword arguments ***delimiter=' '***, ***header=3*** and ***comment='#'***, use ***pd.read_csv()*** again to read ***file_messy*** into a new DataFrame ***df2***.
* Print the output of ***df2.head(***) to verify the file was read correctly.
* Use the DataFrame method ***.to_csv()*** to save the DataFrame ***df2*** to the variable ***file_clean***. Be sure to specify ***index=False***.
* Use the DataFrame method ***.to_excel()*** to save the DataFrame ***df2*** to the file ***'file_clean.xlsx'***. Again, remember to specify ***index=False***

In [None]:
# Read the raw file as-is: df1
file_messy = 'DataCamp-master/11-pandas-foundations/_datasets/messy_stock_data.tsv'
df1 = pd.read_csv(file_messy)

In [None]:
# Print the output of df1.head()
df1.head()

In [None]:
# Read in the file with the correct parameters: df2
df2 = pd.read_csv(file_messy, delimiter=' ', header=3, comment='#')

In [None]:
# Print the output of df2.head()
df2.head()

#### save files

```python
# Save the cleaned up DataFrame to a CSV file without the index
df2.to_csv(file_clean, index=False)
# Save the cleaned up DataFrame to an excel file without the index
df2.to_excel('file_clean.xlsx', index=False)
```

### Plotting with Pandas

In [None]:
cols = ['date', 'open', 'high', 'low', 'close', 'adj_close', 'volume']
aapl = pd.read_csv(r'DataCamp-master/11-pandas-foundations/_datasets/AAPL.csv',
                   names=cols,
                   index_col='date',
                   parse_dates=True,
                   header=0,
                   na_values='null')

In [None]:
aapl.head()

In [None]:
aapl.info()

In [None]:
aapl.tail()

#### Plotting arrays (matplotlib)

In [None]:
close_arr = aapl['close'].values

In [None]:
type(close_arr)

In [None]:
plt.plot(close_arr)

#### Plotting Series (matplotlib)

In [None]:
close_series = aapl['close']

In [None]:
type(close_series)

In [None]:
plt.plot(close_series)

#### Plotting Series (pandas)

In [None]:
close_series.plot()

#### Plotting DataFrames (pandas)

In [None]:
aapl.plot()

#### Plotting DataFrames (matplotlib)

In [None]:
plt.plot(aapl)

#### Fixing Scales

In [None]:
aapl.plot()
plt.yscale('log')
plt.show()

#### Customizing plots

In [None]:
aapl['open'].plot(color='b', style='.-', legend=True)
aapl['close'].plot(color='r', style='.', legend=True)
plt.axis(('2000', '2001', 0, 10))
plt.show()

#### Saving Plots

In [None]:
aapl.loc['2001':'2004', ['open', 'close', 'high', 'low']].plot()

plt.savefig('aapl.png')
plt.savefig('aapl.jpg')
plt.savefig('aapl.pdf')

plt.show()

### Plotting series using pandas

Data visualization is often a very effective first step in gaining a rough understanding of a data set to be analyzed. Pandas provides data visualization by both depending upon and interoperating with the matplotlib library. You will now explore some of the basic plotting mechanics with pandas as well as related matplotlib options. We have pre-loaded a pandas DataFrame ```df``` which contains the data you need. Your job is to use the DataFrame method ```df.plot()``` to visualize the data, and then explore the optional matplotlib input parameters that this ```.plot()``` method accepts.

The pandas ```.plot()``` method makes calls to matplotlib to construct the plots. This means that you can use the skills you've learned in previous visualization courses to customize the plot. In this exercise, you'll add a custom title and axis labels to the figure.

Before plotting, inspect the DataFrame in the IPython Shell using ```df.head()```. Also, use ```type(df)``` and note that it is a single column DataFrame.

Instructions

* Create the plot with the DataFrame method ***df.plot()***. Specify a ***color*** of ***'red'***.
    * Note: ***c*** and ***color*** are interchangeable as parameters here, but we ask you to be explicit and specify ***color***.
* Use ***plt.title()*** to give the plot a title of ***'Temperature in Austin'***.
* Use ***plt.xlabel()*** to give the plot an x-axis label of ***'Hours since midnight August 1, 2010'***.
* Use ***plt.ylabel()*** to give the plot a y-axis label of ***'Temperature (degrees F)'***.
* Finally, display the plot using ***plt.show()***

In [None]:
data_file = 'DataCamp-master/11-pandas-foundations/_datasets/weather_data_austin_2010.csv'
df = pd.read_csv(data_file, usecols=['Temperature'])

In [None]:
df.info()

In [None]:
df.head()

In [None]:
# Create a plot with color='red'
df.plot(color='r')

# Add a title
plt.title('Temperature in Austin')

# Specify the x-axis label
plt.xlabel('Hours since midnight August 1, 2010')

# Specify the y-axis label
plt.ylabel('Temperature (degrees F)')

# Display the plot
plt.show()

### Plotting DataFrames

Comparing data from several columns can be very illuminating. Pandas makes doing so easy with multi-column DataFrames. By default, calling ```df.plot()``` will cause pandas to over-plot all column data, with each column as a single line. In this exercise, we have pre-loaded three columns of data from a weather data set - temperature, dew point, and pressure - but the problem is that pressure has different units of measure. The pressure data, measured in Atmospheres, has a different vertical scaling than that of the other two data columns, which are both measured in degrees Fahrenheit.

Your job is to plot all columns as a multi-line plot, to see the nature of vertical scaling problem. Then, use a list of column names passed into the DataFrame ```df[column_list]``` to limit plotting to just one column, and then just 2 columns of data. When you are finished, you will have created 4 plots. You can cycle through them by clicking on the 'Previous Plot' and 'Next Plot' buttons.

As in the previous exercise, inspect the DataFrame ```df``` in the IPython Shell using the ```.head()``` and ```.info()``` methods.

### Instructions

* Plot all columns together on one figure by calling ***df.plot()***, and noting the vertical scaling problem.
* Plot all columns as subplots. To do so, you need to specify ***subplots=True*** inside ***.plot()***.
* Plot a single column of dew point data. To do this, define a column list containing a single column name ***'Dew Point (deg F)'***, and call ***df[column_list1].plot()***.
* Plot two columns of data, ***'Temperature (deg F)'*** and ***'Dew Point (deg F)'***. To do this, define a list containing those column names and pass it into ***df[]***, as ***df[column_list2].plot()***.

In [None]:
data_file = 'DataCamp-master/11-pandas-foundations/_datasets/weather_data_austin_2010.csv'
df = pd.read_csv(data_file, parse_dates=[3], index_col='Date')
df.head()

In [None]:
# Plot all columns (default)
df.plot()
plt.show()

In [None]:
# Plot all columns as subplots
df.plot(subplots=True)
plt.show()

In [None]:
# Plot just the Dew Point data
column_list1 = ['DewPoint']
df[column_list1].plot()
plt.show()

In [None]:
# Plot the Dew Point and Temperature data, but not the Pressure data
column_list2 = ['Temperature','DewPoint']
df[column_list2].plot()
plt.show()

## Exploratory Data Analysis

### Visual exploratory data analysis

#### The Iris Dataset

* Famous dataset in pattern recognition
* 150 observations, 4 features each
    * Sepal length
    * Sepal width
    * Petal length
    * Petal width
* 3 species:
    * setosa
    * versicolor
    * virginica

In [None]:
data_file = 'DataCamp-master/11-pandas-foundations/_datasets/iris.csv'
iris = pd.read_csv(data_file)

In [None]:
iris.shape

In [None]:
iris.head()

#### Line plot

In [None]:
iris.plot(x='sepal length (cm)', y='sepal width (cm)')

#### Scatter Plot

In [None]:
iris.plot(x='sepal length (cm)', y='sepal width (cm)',
          kind='scatter')
plt.xlabel('sepal length (cm)')
plt.ylabel('sepal width (cm)')

#### Box Plot

In [None]:
iris.plot(y='sepal length (cm)',
          kind='box')
plt.ylabel('sepal length (cm)')

#### Histogram

In [None]:
iris.plot(y='sepal length (cm)',
          kind='hist')
plt.xlabel('sepal length (cm)')

#### Histogram Options

* **bins** (integer): number of intervals or bins
* **range** (tuple): extrema of bins (minimum, maximum)
* **density** (boolean): whether to normalized to one - formerly this was **normed**
* **cumulative** (boolean): computer Cumulative Distributions Function (CDF)
* ... more matplotlib customizations

#### Customizing Histogram

In [None]:
iris.plot(y='sepal length (cm)',
          kind='hist',
          bins=30,
          range=(4, 8),
          density=True)
plt.xlabel('sepal length (cm)')

#### Cumulative Distribution

In [None]:
iris.plot(y='sepal length (cm)',
          kind='hist',
          bins=30,
          range=(4, 8),
          density=True,
          cumulative=True)
plt.xlabel('sepal length (cm)')
plt.title('Cumulative Distribution Function (CDF)')

#### Word of Warning

* Three different DataFrame plot idioms
    * iris.plot(kind='hist')
    * iris.plt.hist()
    * iris.hist()
* Syntax / Results differ!
* Pandas API still evolving: chech the documentation

### pandas line plots

In the previous chapter, you saw that the ```.plot()``` method will place the Index values on the x-axis by default. In this exercise, you'll practice making line plots with specific columns on the x and y axes.

You will work with a dataset consisting of monthly stock prices in 2015 for AAPL, GOOG, and IBM. The stock prices were obtained from [Yahoo Finance](#http://finance.yahoo.com/```). Your job is to plot the 'Month' column on the x-axis and the AAPL and IBM prices on the y-axis using a list of column names.

All necessary modules have been imported for you, and the DataFrame is available in the workspace as df. Explore it using methods such as ```.head()```, ```.info()```, and ```.describe()``` to see the column names.

#### Instructions

* Create a list of y-axis column names called ***y_columns*** consisting of ***'AAPL'*** and ***'IBM'***.
* Generate a line plot with ***x='Month'*** and ***y=y_columns*** as inputs.
* Give the plot a title of ***'Monthly stock prices'***.
* Specify the y-axis label.
* Display the plot.

In [None]:
values = [['Jan', 117.160004, 534.5224450000002, 153.309998],
          ['Feb', 128.46000700000002, 558.402511, 161.940002],
          ['Mar', 124.43, 548.002468, 160.5],
          ['Apr', 125.150002, 537.340027, 171.28999299999995],
          ['May', 130.279999, 532.1099849999998, 169.649994],
          ['Jun', 125.43, 520.51001, 162.660004],
          ['Jul', 121.300003, 625.6099849999998, 161.990005],
          ['Aug', 112.760002, 618.25, 147.889999],
          ['Sep', 110.300003, 608.419983, 144.970001],
          ['Oct', 119.5, 710.8099980000002, 140.080002],
          ['Nov', 118.300003, 742.599976, 139.419998],
          ['Dec', 105.260002, 758.880005, 137.619995]]

values = np.array(values).transpose()

In [None]:
cols = ['Month', 'AAPL', 'GOOG', 'IBM']

In [None]:
data_zipped = list(zip(cols, values))

In [None]:
data_dict = dict(data_zipped)

In [None]:
df = pd.DataFrame.from_dict(data_dict, dtype='float')

In [None]:
df

In [None]:
df.info()

In [None]:
# Create a list of y-axis column names: y_columns
y_columns = ['AAPL', 'IBM']

# Generate a line plot
df.plot(x='Month', y=y_columns)

# Add the title
plt.title('Monthly stock prices')

# Add the y-axis label
plt.ylabel('Price ($US)')

# Display the plot
plt.show()

### pandas scatter plots

Pandas scatter plots are generated using the ```kind='scatter'``` keyword argument. Scatter plots require that the x and y columns be chosen by specifying the ```x``` and ```y``` parameters inside ```.plot()```. Scatter plots also take an ```s``` keyword argument to provide the radius of each circle to plot in pixels.

In this exercise, you're going to plot fuel efficiency (miles-per-gallon) versus horse-power for 392 automobiles manufactured from 1970 to 1982 from the [UCI Machine Learning Repository](#https://archive.ics.uci.edu/ml/datasets/Auto+MPG).

The size of each circle is provided as a NumPy array called ```sizes```. This array contains the normalized ```'weight'``` of each automobile in the dataset.

All necessary modules have been imported and the DataFrame is available in the workspace as df.

#### Instructions

* Generate a scatter plot with ***'hp'*** on the x-axis and ***'mpg'*** on the y-axis. Specify ***s=sizes***.
* Add a title to the plot.
* Specify the x-axis and y-axis labels.

In [None]:
data_file = 'DataCamp-master/11-pandas-foundations/_datasets/auto-mpg.csv'
df = pd.read_csv(data_file)
df.head()

In [None]:
df.info()

In [None]:
sizes = np.array([ 51.12044694,  56.78387977,  49.15557238,  49.06977358,
        49.52823321,  78.4595872 ,  78.93021696,  77.41479205,
        81.52541106,  61.71459825,  52.85646225,  54.23007578,
        58.89427963,  39.65137852,  23.42587473,  33.41639502,
        32.03903011,  27.8650165 ,  18.88972581,  14.0196956 ,
        29.72619722,  24.58549713,  23.48516821,  20.77938954,
        29.19459189,  88.67676838,  79.72987328,  79.94866084,
        93.23005042,  18.88972581,  21.34122243,  20.6679223 ,
        28.88670381,  49.24144612,  46.14174741,  45.39631334,
        45.01218186,  73.76057586,  82.96880195,  71.84547684,
        69.85320595, 102.22421043,  93.78252358, 110.        ,
        36.52889673,  24.14234281,  44.84805372,  41.02504618,
        20.51976563,  18.765772  ,  17.9095202 ,  17.75442285,
        13.08832041,  10.83266174,  14.00441945,  15.91328975,
        21.60597587,  18.8188451 ,  21.15311208,  24.14234281,
        20.63083317,  76.05635059,  80.05816704,  71.18975117,
        70.98330444,  56.13992036,  89.36985382,  84.38736544,
        82.6716892 ,  81.4149056 ,  22.60363518,  63.06844313,
        69.92143863,  76.76982089,  69.2066568 ,  35.81711267,
        26.25184749,  36.94940537,  19.95069229,  23.88237331,
        21.79608472,  26.1474042 ,  19.49759118,  18.36136808,
        69.98970461,  56.13992036,  66.21810474,  68.02351436,
        59.39644014, 102.10046481,  82.96880195,  79.25686195,
        74.74521151,  93.34830013, 102.05923292,  60.7883734 ,
        40.55589449,  44.7388015 ,  36.11079464,  37.9986264 ,
        35.11233175,  15.83199594, 103.96451839, 100.21241654,
        90.18186347,  84.27493641,  32.38645967,  21.62494928,
        24.00218436,  23.56434276,  18.78345471,  22.21725537,
        25.44271071,  21.36007926,  69.37650986,  76.19877818,
        14.51292942,  19.38962134,  27.75740889,  34.24717407,
        48.10262495,  29.459795  ,  32.80584831,  55.89556844,
        40.06360581,  35.03982309,  46.33599903,  15.83199594,
        25.01226779,  14.03498009,  26.90404245,  59.52231336,
        54.92349014,  54.35035315,  71.39649768,  91.93424995,
        82.70879915,  89.56285636,  75.45251972,  20.50128352,
        16.04379287,  22.02531454,  11.32159874,  16.70430249,
        18.80114574,  18.50153068,  21.00322336,  25.79385418,
        23.80266582,  16.65430211,  44.35746794,  49.815853  ,
        49.04119063,  41.52318884,  90.72524338,  82.07906251,
        84.23747672,  90.29816462,  63.55551901,  63.23059357,
        57.92740995,  59.64831981,  38.45278922,  43.19643409,
        41.81296121,  19.62393488,  28.99647648,  35.35456858,
        27.97283229,  30.39744886,  20.57526193,  26.96758278,
        37.07354237,  15.62160631,  42.92863291,  30.21771564,
        36.40567571,  36.11079464,  29.70395123,  13.41514444,
        25.27829944,  20.51976563,  27.54281821,  21.17188565,
        20.18836167,  73.97101962,  73.09614831,  65.35749368,
        73.97101962,  43.51889468,  46.80945169,  37.77255674,
        39.6256851 ,  17.24230306,  19.49759118,  15.62160631,
        13.41514444,  55.49963323,  53.18333207,  55.31736854,
        42.44868923,  13.86730874,  16.48817545,  19.33574884,
        27.3931002 ,  41.31307817,  64.63368105,  44.52069676,
        35.74387954,  60.75655952,  79.87569835,  68.46177648,
        62.35745431,  58.70651902,  17.41217694,  19.33574884,
        13.86730874,  22.02531454,  15.75091031,  62.68013142,
        68.63071356,  71.36201911,  76.80558184,  51.58836621,
        48.84134317,  54.86301837,  51.73502816,  74.14661842,
        72.22648148,  77.88228247,  78.24284811,  15.67003285,
        31.25845963,  21.36007926,  31.60164234,  17.51450098,
        17.92679488,  16.40542438,  19.96892459,  32.99310928,
        28.14577056,  30.80379718,  16.40542438,  13.48998471,
        16.40542438,  17.84050478,  13.48998471,  47.1451025 ,
        58.08281541,  53.06435374,  52.02897659,  41.44433489,
        36.60292926,  30.80379718,  48.98404972,  42.90189859,
        47.56635225,  39.24128299,  54.56115914,  48.41447259,
        48.84134317,  49.41341845,  42.76835191,  69.30854366,
        19.33574884,  27.28640858,  22.02531454,  20.70504474,
        26.33555201,  31.37264569,  33.93740821,  24.08222494,
        33.34566004,  41.05118927,  32.52595611,  48.41447259,
        16.48817545,  18.97851406,  43.84255439,  37.22278157,
        34.77459916,  44.38465193,  47.00510227,  61.39441929,
        57.77221268,  65.12675249,  61.07507305,  79.14790534,
        68.42801405,  54.10993164,  64.63368105,  15.42864956,
        16.24054679,  15.26876826,  29.68171358,  51.88189829,
        63.32798377,  42.36896092,  48.6988448 ,  20.15170555,
        19.24612787,  16.98905358,  18.88972581,  29.68171358,
        28.03762169,  30.35246559,  27.20120517,  19.13885751,
        16.12562794,  18.71277385,  16.9722369 ,  29.85984799,
        34.29495526,  37.54716158,  47.59450219,  19.93246832,
        30.60028577,  26.90404245,  24.66650366,  21.36007926,
        18.5366546 ,  32.64243213,  18.5366546 ,  18.09999962,
        22.70075058,  36.23351603,  43.97776651,  14.24983724,
        19.15671509,  14.17291518,  35.25757392,  24.38356372,
        26.02234705,  21.83420642,  25.81458463,  28.90864169,
        28.58044785,  30.91715052,  23.6833544 ,  12.82391671,
        14.63757021,  12.89709155,  17.75442285,  16.24054679,
        17.49742615,  16.40542438,  20.42743834,  17.41217694,
        23.58415722,  19.96892459,  20.33531923,  22.99334585,
        28.47146626,  28.90864169,  43.43816712,  41.57579979,
        35.01567018,  35.74387954,  48.5565546 ,  57.77221268,
        38.98605581,  49.98882458,  28.25412762,  29.01845599,
        23.88237331,  27.60710798,  26.54539622,  31.14448175,
        34.17556473,  16.3228815 ,  17.0732619 ,  16.15842026,
        18.80114574,  18.80114574,  19.42557798,  20.2434083 ,
        20.98452475,  16.07650192,  16.07650192,  16.57113469,
        36.11079464,  37.84783835,  27.82194848,  33.46359332,
        29.5706502 ,  23.38638738,  36.23351603,  32.40968826,
        18.88972581,  21.92965639,  28.68963762,  30.80379718])

In [None]:
# Generate a scatter plot
df.plot(kind='scatter', x='hp', y='mpg', s=sizes)

# Add the title
plt.title('Fuel efficiency vs Horse-power')

# Add the x-axis label
plt.xlabel('Horse-power')

# Add the y-axis label
plt.ylabel('Fuel efficiency (mpg)')

# Display the plot
plt.show()

### pandas box plots

While pandas can plot multiple columns of data in a single figure, making plots that share the same x and y axes, there are cases where two columns cannot be plotted together because their units do not match. The ```.plot()``` method can generate subplots for each column being plotted. Here, each plot will be scaled independently.

In this exercise your job is to generate box plots for ***fuel efficiency (mpg)*** and ***weight*** from the automobiles data set. To do this in a single figure, you'll specify ```subplots=True``` inside ```.plot()``` to generate two separate plots.

All necessary modules have been imported and the automobiles dataset is available in the workspace as ```df```.

#### Instructions

* Make a list called ***cols*** of the column names to be plotted: ***'weight'*** and ***'mpg'***.
* Call plot on ***df[cols]*** to generate a box plot of the two columns in a single figure. To do this, specify ***subplots=True***.

In [None]:
# Make a list of the column names to be plotted: cols
cols = ['weight', 'mpg']

# Generate the box plots
df[cols].plot(kind='box', subplots=True)

# Display the plot
plt.show()

### pandas hist, pdf and cd

Pandas relies on the ```.hist()``` method to not only generate histograms, but also plots of probability density functions (PDFs) and cumulative density functions (CDFs).

In this exercise, you will work with a dataset consisting of restaurant bills that includes the amount customers tipped.

The original dataset is provided by the [Seaborn package](#https://github.com/mwaskom/seaborn-data/blob/master/tips.csv).

Your job is to plot a PDF and CDF for the fraction column of the tips dataset. This column contains information about what ```fraction``` of the total bill is comprised of the tip.

Remember, when plotting the PDF, you need to specify ```normed=True``` in your call to ```.hist()```, and when plotting the CDF, you need to specify ```cumulative=True``` in addition to ```normed=True```.

All necessary modules have been imported and the tips dataset is available in the workspace as ```df```. Also, some formatting code has been written so that the plots you generate will appear on separate rows.

#### Instructions

* Plot a PDF for the values in ***fraction*** with 30 ***bins*** between 0 and 30%. The range has been taken care of for you. ***ax=axes[0]*** means that this plot will appear in the first row.
* Plot a CDF for the values in ***fraction*** with 30 ***bins*** between 0 and 30%. Again, the range has been specified for you. To make the CDF appear on the second row, you need to specify ***ax=axes[1]***.

In [None]:
data_file = 'DataCamp-master/11-pandas-foundations/_datasets/tips.csv'
df = pd.read_csv(data_file)
df.head()

In [None]:
# This formats the plots such that they appear on separate rows
fig, axes = plt.subplots(nrows=2, ncols=1)

# Plot the PDF
df.fraction.plot(ax=axes[0], kind='hist', bins=30, density=True, range=(0,.3))

# Plot the CDF
df.fraction.plot(ax=axes[1], kind='hist', bins=30, density=True, cumulative=True, range=(0,.3))

### Statistical Exploratory Data Analysis

#### Summarizing with describe()

***Describe***
* count: number of entires
* mean: average of entries
* std: standard deviation
* min: miniumum entry
* 25%: first quartile
* 50%: median or second quartile
* 75%: third quartile
* max: maximum entry

In [None]:
iris.describe()  # summary statistics

#### Counts

In [None]:
iris['sepal length (cm)'].count()  # Applied to Series

In [None]:
iris['sepal width (cm)'].count()  # Applied to Series

In [None]:
iris[['petal length (cm)', 'petal width (cm)']].count()  # Applied to DataFrame

In [None]:
type(iris[['petal length (cm)', 'petal width (cm)']].count())  # Returns series

#### Averages

* measures the tendency to a central value of a measurement

In [None]:
iris['sepal length (cm)'].mean()  # Applied to Series

In [None]:
iris.mean()  # Applied to entire DataFrame

#### Standard Deviations (std)

* measures spread of a measurement

In [None]:
iris.std()

#### Mean and Standard Deviation on a Bell Curve

In [None]:
iris['sepal width (cm)'].plot(kind='hist', bins=30)

#### Medians

* middle number of the measurements
* special example of a quantile

In [None]:
iris.median()

#### Quantile

* If q is between 0 and 1, the qth quantile of a dataset is a numerical value that splits the data into two sets
    * one with the fraction q of smaller observations
    * one with the fraction q of larger observations
* Quantiles are percentages
* Median is the 0.5 quantile or the 50th percentile of a dataset

In [None]:
q = 0.5
iris.quantile(q)

#### Inter-quartile range (IQR)

In [None]:
q = [0.25, 0.75]
iris.quantile(q)

#### Range

* interval between the smallest and largest observations
* given by the min and max methods

In [None]:
iris.min()

In [None]:
iris.max()

#### Box Plots

In [None]:
iris.plot(kind='box')
plt.ylabel('[cm]')

### Exercises

#### Fuel efficiency

From the automobiles data set, which value corresponds to the median value of the ```'mpg'``` column? Your job is to select the ```'mpg'``` column and call the ```.median()``` method on it. The automobile DataFrame has been provided as ```df```.

In [None]:
data_file = 'DataCamp-master/11-pandas-foundations/_datasets/auto-mpg.csv'
df = pd.read_csv(data_file)
df.head()

In [None]:
df.median()

#### Bachelor's degrees awarded to women
In this exercise, you will investigate statistics of the percentage of Bachelor's degrees awarded to women from 1970 to 2011. Data is recorded every year for 17 different fields. This data set was obtained from the [Digest of Education Statistics](#http://nces.ed.gov/programs/digest/2013menu_tables.asp).

Your job is to compute the minimum and maximum values of the ```'Engineering'``` column and generate a line plot of the mean value of all 17 academic fields per year. To perform this step, you'll use the ```.mean()``` method with the keyword argument ```axis='columns'```. This computes the mean across all columns per row.

The DataFrame has been pre-loaded for you as ```df``` with the index set to ```'Year'```.

***Instructions***

* Print the minimum value of the ***'Engineering'*** column.
* Print the maximum value of the ***'Engineering'*** column.
* Construct the mean percentage per year with ***.mean(axis='columns')***. Assign the result to ***mean***.
* Plot the average percentage per year. Since ***'Year'*** is the index of ***df***, it will appear on the x-axis of the plot. No keyword arguments are needed in your call to ***.plot()***.


In [None]:
data_file = 'DataCamp-master/11-pandas-foundations/_datasets/percent-bachelors-degrees-women-usa.csv'
df = pd.read_csv(data_file, index_col='Year')
df.head()

In [None]:
# Print the minimum value of the Engineering column
df.Engineering.min()

In [None]:
# Print the maximum value of the Engineering column
df.Engineering.max()

In [None]:
# Construct the mean percentage per year: mean
mean = df.mean(axis='columns')
mean.head()

In [None]:
# Plot the average percentage per year
mean.plot()

#### Median vs mean

In many data sets, there can be large differences in the mean and median value due to the presence of outliers.

In this exercise, you'll investigate the mean, median, and max fare prices paid by passengers on the Titanic and generate a box plot of the fare prices. This data set was obtained from [Vanderbilt University](#http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html).

All necessary modules have been imported and the DataFrame is available in the workspace as ```df```.

***Instructions***

* Print summary statistics of the ***'fare'*** column of ***df*** with ***.describe()*** and ***print()***. Note: ***df.fare*** and ***df['fare']*** are equivalent.
* Generate a box plot of the ***'fare'*** column.

In [None]:
data_file = 'DataCamp-master/11-pandas-foundations/_datasets/titanic.csv'
df = pd.read_csv(data_file)
df.head(3)

In [None]:
df.fare.describe()

In [None]:
df.fare.plot(kind='box')

#### Quantiles

In this exercise, you'll investigate the probabilities of life expectancy in countries around the world. This dataset contains life expectancy for persons born each year from 1800 to 2015. Since country names change or results are not reported, not every country has values. This dataset was obtained from [Gapminder](#https://docs.google.com/a/continuum.io/spreadsheets/d/1dgOdlUEq6_V55OHZCxz5BG_0uoghJTeA6f83br5peNs/pub?range=A1:D70&gid=1&output=html#).

First, you will determine the number of countries reported in 2015. There are a total of 260 unique countries in the entire dataset. Then, you will compute the 5th and 95th percentiles of life expectancy over the entire dataset. Finally, you will make a box plot of life expectancy every 50 years from 1800 to 2000. Notice the large change in the distributions over this period.

The dataset has been pre-loaded into a DataFrame called ```df```.

***Instructions***

* Print the number of countries reported in 2015. To do this, use the ***.count()*** method on the ***'2015'*** column of ***df***.
* Print the 5th and 95th percentiles of ***df***. To do this, use the ***.quantile()*** method with the list ***[0.05, 0.95]***.
* Generate a box plot using the list of columns provided in ***years***. This has already been done for you, so click on 'Submit Answer' to view the result!

In [None]:
data_file = 'DataCamp-master/11-pandas-foundations/_datasets/life_expectancy_at_birth.csv'
df = pd.read_csv(data_file)
df.head(3)

In [None]:
# Print the number of countries reported in 2015
df['2015'].count()

In [None]:
# Print the 5th and 95th percentiles
df.quantile([0.05, 0.95])

In [None]:
# Generate a box plot
years = ['1800','1850','1900','1950','2000']
df[years].plot(kind='box')

#### Standard deviation of temperature

Let's use the mean and standard deviation to explore differences in temperature distributions in Pittsburgh in 2013. The data has been obtained from [Weather Underground](#https://www.wunderground.com/history/).

In this exercise, you're going to compare the distribution of daily temperatures in January and March. You'll compute the mean and standard deviation for these two months. You will notice that while the mean values are similar, the standard deviations are quite different, meaning that one month had a larger fluctuation in temperature than the other.

The DataFrames have been pre-loaded for you as ```january```, which contains the January data, and ```march```, which contains the March data.

***Instructions***

* Compute and print the means of the January and March data using the ***.mean()*** method.
* Compute and print the standard deviations of the January and March data using the ***.std()*** method.

In [None]:
jan_values = np.array([['2013-01-01', 28],
                       ['2013-01-02', 21],
                       ['2013-01-03', 24],
                       ['2013-01-04', 28],
                       ['2013-01-05', 30],
                       ['2013-01-06', 34],
                       ['2013-01-07', 29],
                       ['2013-01-08', 31],
                       ['2013-01-09', 36],
                       ['2013-01-10', 34],
                       ['2013-01-11', 47],
                       ['2013-01-12', 55],
                       ['2013-01-13', 62],
                       ['2013-01-14', 44],
                       ['2013-01-15', 30],
                       ['2013-01-16', 32],
                       ['2013-01-17', 32],
                       ['2013-01-18', 24],
                       ['2013-01-19', 42],
                       ['2013-01-20', 35],
                       ['2013-01-21', 18],
                       ['2013-01-22', 9],
                       ['2013-01-23', 11],
                       ['2013-01-24', 16],
                       ['2013-01-25', 16],
                       ['2013-01-26', 23],
                       ['2013-01-27', 23],
                       ['2013-01-28', 40],
                       ['2013-01-29', 59],
                       ['2013-01-30', 58],
                       ['2013-01-31', 32]]).transpose()
cols = ['Date', 'Temperature']
jan_zip = list(zip(cols, jan_values))
jan_dict = dict(jan_zip)
january = pd.DataFrame.from_dict(jan_dict).astype({'Temperature': np.int64})
january.head()

In [None]:
mar_values = np.array([['2013-03-01', 28],
                       ['2013-03-02', 26],
                       ['2013-03-03', 24],
                       ['2013-03-04', 28],
                       ['2013-03-05', 32],
                       ['2013-03-06', 34],
                       ['2013-03-07', 36],
                       ['2013-03-08', 32],
                       ['2013-03-09', 40],
                       ['2013-03-10', 55],
                       ['2013-03-11', 55],
                       ['2013-03-12', 40],
                       ['2013-03-13', 32],
                       ['2013-03-14', 30],
                       ['2013-03-15', 38],
                       ['2013-03-16', 36],
                       ['2013-03-17', 32],
                       ['2013-03-18', 34],
                       ['2013-03-19', 36],
                       ['2013-03-20', 32],
                       ['2013-03-21', 22],
                       ['2013-03-22', 28],
                       ['2013-03-23', 34],
                       ['2013-03-24', 34],
                       ['2013-03-25', 32],
                       ['2013-03-26', 34],
                       ['2013-03-27', 34],
                       ['2013-03-28', 37],
                       ['2013-03-29', 43],
                       ['2013-03-30', 43],
                       ['2013-03-31', 44]]).transpose()
mar_zip = list(zip(cols, mar_values))
mar_dict = dict(mar_zip)
march = pd.DataFrame.from_dict(mar_dict).astype({'Temperature': np.int64})
march.head()

In [None]:
# Print the mean of the January and March data
january.mean()

In [None]:
march.mean()

In [None]:
# Print the standard deviation of the January and March data
january.std()

In [None]:
march.std()

### Separating Populations with Boolean Indexing

#### Describe species column

* contains categorical data
* count: number of non-null entries
* unique: number of distinct values
* top: most frequent category
* freq: number of occurrences of the top value

In [None]:
iris.species.describe()

#### Unique and Factors

In [None]:
iris.species.unique()

#### Filtering by species

In [None]:
indices = iris['species'] == 'setosa'
setosa = iris.loc[indices,:]  # extract new DataFrame

indices = iris['species'] == 'versicolor'
versicolor = iris.loc[indices,:]  # extract new DataFrame

indices = iris['species'] == 'virginica'
virginica = iris.loc[indices,:]  # extract new DataFrame

#### Checking species

In [None]:
setosa['species'].unique()

In [None]:
versicolor['species'].unique()

In [None]:
virginica['species'].unique()

In [None]:
setosa.head(2)

In [None]:
versicolor.head(2)

In [None]:
virginica.head(2)

#### Visual EDA: All Data

In [None]:
iris.plot(kind='hist',
          bins=50,
          range=(0, 8),
          alpha=0.3)
plt.title('Entire Iris Dataset')
plt.xlabel('[cm]')

#### Visual EDA: Individual Factors

In [None]:
setosa.plot(kind='hist',
          bins=50,
          range=(0, 8),
          alpha=0.3)
plt.title('Setosa Dataset')
plt.xlabel('[cm]')

versicolor.plot(kind='hist',
          bins=50,
          range=(0, 8),
          alpha=0.3)
plt.title('Versicolor Dataset')
plt.xlabel('[cm]')

virginica.plot(kind='hist',
          bins=50,
          range=(0, 8),
          alpha=0.3)
plt.title('Virginica Dataset')
plt.xlabel('[cm]')

#### Statistical EDA: describe()

In [None]:
describe_all = iris.describe()
describe_all

In [None]:
describe_setosa = setosa.describe()
describe_setosa

In [None]:
describe_versicolor = versicolor.describe()
describe_versicolor

In [None]:
describe_virginica = virginica.describe()
describe_virginica

#### Computing Errors

* This is the absolute difference of the correct statistics computed in its own group from the statistic computed with the whole population divided by the correct statistics
* Elementwise arithmetic so no need for loops

In [None]:
error_setosa = 100 * np.abs(describe_setosa - describe_all)
error_setosa = error_setosa / describe_setosa
error_setosa

In [None]:
error_versicolor = 100 * np.abs(describe_versicolor - describe_all)
error_versicolor = error_versicolor / describe_versicolor
error_versicolor

In [None]:
error_virginica = 100 * np.abs(describe_virginica - describe_all)
error_virginica = error_virginica / describe_virginica
error_virginica

## Time Series in pandas

### Indexing pandas time series

## Case Study - Sunlight in Austin

### Reading and Cleaning the Data