In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from numpy import NaN
from glob import glob
import re

In [None]:
pd.set_option('max_columns', 200)
pd.set_option('max_rows', 300)
pd.set_option('display.expand_frame_repr', True)

# Review of Pandas DataFrames

## Data ingestion & inspection

### pandas DataFrames

* Example: DataFrame of Apple Stock data

In [None]:
AAPL = pd.read_csv(r'DataCamp-master/11-pandas-foundations/_datasets/AAPL.csv',
                   index_col='Date', parse_dates=True)

In [None]:
AAPL.head()

* The rows are labeled by a special data structure called an Index.
    * Indexes in Pandas are tailored lists of labels that permit fast look-up and some powerful relational operations.
* The index labels in the AAPL DataFrame are dates in reverse chronological order.
* Labeled rows & columns improves the clarity and intuition of many data analysis tasks.

In [None]:
type(AAPL)

In [None]:
AAPL.shape

In [None]:
AAPL.columns

In [None]:
type(AAPL.columns)

In [None]:
AAPL.index

In [None]:
type(AAPL.index)

* DataFrames can be sliced like NumPy arrays or Python lists using colons to specify the start, end and stride of a slice.

In [None]:
# Start of the DataFrame to the 5th row, inclusive of all columns
AAPL.iloc[:5,:]

In [None]:
# Start at the 5th last row to the end of the DataFrame using a negative index
AAPL.iloc[-5:,:]

In [None]:
AAPL.head()

In [None]:
AAPL.tail()

In [None]:
AAPL.info()

In [None]:
AAPL.Close.plot(kind='line')

# Add first subplot
plt.subplot(2, 1, 1)
AAPL.Close.plot(kind='line')

# Add title and specify axis labels
plt.title('Close')
plt.ylabel('Value - $')
plt.xlabel('Year')

# Add second subplot
plt.subplot(2, 1, 2)
AAPL.Volume.plot(kind='line')

# Add title and specify axis labels
plt.title('Volume')
plt.ylabel('Number of Shares')
plt.xlabel('Year')

# Display the plots
plt.tight_layout()
plt.show()

### Broadcasting

* Assigning scalar value to column slice broadcasts value to each row

In [None]:
AAPL.iloc[::3, -1] = np.nan  # every 3rd row of Volume is now NaN

In [None]:
AAPL.head(7)

In [None]:
AAPL.info()

* Note Volume now has few non-null numbers

### Series

In [None]:
low = AAPL.Low

In [None]:
type(low)

In [None]:
low.head()

In [None]:
lows = low.values

In [None]:
type(lows)

In [None]:
lows[0:5]

* A Pandas Seriew, then, is a 1D labeled NumPy array and a DataFrame is a 2D labeled array whose columns as Series

### Inspecting your data

You can use the DataFrame methods ```.head()``` and ```.tail()``` to view the first few and last few rows of a DataFrame. In this exercise, we have imported pandas as ```pd``` and loaded population data from 1960 to 2014 as a DataFrame ```df```. This dataset was obtained from the World Bank.

Your job is to use ```df.head()``` and ```df.tail()``` to verify that the first and last rows match a file on disk. In later exercises, you will see how to extract values from DataFrames with indexing, but for now, manually copy/paste or type values into assignment statements where needed. Select the correct answer for the first and last values in the ```'Year'``` and ```'Total Population'``` columns.

### Instructions

Possible Answers
* First: 1980, 26183676.0; Last: 2000, 35.
* First: 1960, 92495902.0; Last: 2014, 15245855.0.
* First: 40.472, 2001; Last: 44.5, 1880.
* First: CSS, 104170.0; Last: USA, 95.203.

In [None]:
wb_df = pd.read_csv(r'DataCamp-master/11-pandas-foundations/_datasets/world_ind_pop_data.csv')

In [None]:
wb_df.head()

In [None]:
wb_df.tail()

### DataFrame data types

Pandas is aware of the data types in the columns of your DataFrame. It is also aware of null and ```NaN``` ('Not-a-Number') types which often indicate missing data. In this exercise, we have imported pandas as ```pd``` and read in the world population data which contains some ```NaN``` values, a value often used as a place-holder for missing or otherwise invalid data entries. Your job is to use ```df.info()``` to determine information about the total count of ```non-null``` entries and infer the total count of ```'null'``` entries, which likely indicates missing data. Select the best description of this data set from the following:

### Instructions

Possible Answers
* The data is all of type float64 and none of it is missing.
* The data is of mixed type, and 9914 of it is missing.
* The data is of mixed type, and 3460 float64s are missing.
* The data is all of type float64, and 3460 float64s are missing.

```python
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13374 entries, 0 to 13373
Data columns (total 5 columns):
CountryName                      13374 non-null object
CountryCode                      13374 non-null object
Year                             13374 non-null int64
Total Population                 9914 non-null float64
Urban population (% of total)    13374 non-null float64
dtypes: float64(2), int64(1), object(2)
memory usage: 522.5+ KB
```

In [None]:
wb_df.info()

### NumPy and pandas working together
Pandas depends upon and interoperates with NumPy, the Python library for fast numeric array computations. For example, you can use the DataFrame attribute ```.values``` to represent a DataFrame ```df``` as a NumPy array. You can also pass pandas data structures to NumPy methods. In this exercise, we have imported pandas as ```pd``` and loaded world population data every 10 years since 1960 into the DataFrame ```df```. This dataset was derived from the one used in the previous exercise.

Your job is to extract the values and store them in an array using the attribute ```.values```. You'll then use those values as input into the NumPy ```np.log10()``` method to compute the base 10 logarithm of the population values. Finally, you will pass the entire pandas DataFrame into the same NumPy ```np.log10()``` method and compare the results.

### Instructions

* Import ```numpy``` using the standard alias ```np```.
* Assign the numerical values in the DataFrame ```df``` to an array ```np_vals``` using the attribute ```values```.
* Pass ```np_vals``` into the NumPy method ```log10()``` and store the results in ```np_vals_log10```.
* Pass the entire ```df``` DataFrame into the NumPy method ```log10()``` and store the results in ```df_log10```.
* Inspect the output of the ```print()``` code to see the ```type()``` of the variables that you created.

In [None]:
pop_df = pd.read_csv(r'DataCamp-master/11-pandas-foundations/_datasets/world_population.csv')

In [None]:
pop_df.info()

In [None]:
# Create array of DataFrame values: np_vals
np_vals = pop_df.values

In [None]:
np_vals

In [None]:
# Create new array of base 10 logarithm values: np_vals_log10
np_vals_log10 = np.log10(np_vals)

In [None]:
np_vals_log10

In [None]:
# Create array of new DataFrame by passing df to np.log10(): df_log10
pop_df_log10 = np.log10(pop_df)

In [None]:
pop_df_log10

In [None]:
# Print original and new data containers
[print(x, 'has type', type(eval(x))) for x in ['np_vals', 'np_vals_log10', 'pop_df', 'pop_df_log10']]

#### As a data scientist, you'll frequently interact with NumPy arrays, pandas Series, and pandas DataFrames, and you'll leverage a variety of NumPy and pandas methods to perform your desired computations. Understanding how NumPy and pandas work together will prove to be very useful.

### Building DataFrames from Scratch

* DataFrames read in from CSV
```python
pd.read_csv()
```

* DataFrames from dict (1)

In [None]:
data = {'weekday': ['Sun', 'Sun', 'Mon', 'Mon'],
        'city': ['Austin', 'Dallas', 'Austin', 'Dallas'],
        'visitors': [139, 237, 326, 456],
        'signups': [7, 12, 3, 5]}

In [None]:
users = pd.DataFrame(data)

In [None]:
users

* DataFrames from dict (2)
    * lists

In [None]:
cities = ['Austin', 'Dallas', 'Austin', 'Dallas']
signups = [7, 12, 3, 5]
weekdays = ['Sun', 'Sun', 'Mon', 'Mon']
visitors = [139, 237, 326, 456]

list_labels = ['city', 'signups', 'visitors', 'weekday']
list_cols = [cities, signups, visitors, weekdays]  # list of lists

zipped = list(zip(list_labels, list_cols))  # tuples
zipped

* DataFrames from dict (3)

In [None]:
data2 = dict(zipped)

In [None]:
users2 = pd.DataFrame(data2)

In [None]:
users2

### Broadcasting

* Saves time by generating long lists, arrays or columns without loops

In [None]:
users['fees'] = 0  # Broadcasts value to entire column

In [None]:
users

### Broadcasting with a dict

In [None]:
heights = [59.0, 65.2, 62.9, 65.4, 63.7, 65.7, 64.1]

In [None]:
data = {'height': heights, 'sex': 'M'}  # M is broadcast to the entire column

In [None]:
results = pd.DataFrame(data)

In [None]:
results

### Index and columns

* We can assign list of strings to the attributes columns and index as long as they are of suitable length.

In [None]:
results.columns = ['height (in)', 'sex']

In [None]:
results.index = ['A', 'B', 'C', 'D', 'E', 'F', 'G']

In [None]:
results

### Zip lists to build a DataFrame

In this exercise, you're going to make a pandas DataFrame of the top three countries to win gold medals since 1896 by first building a dictionary. ```list_keys``` contains the column names ```'Country'``` and ```'Total'```. ```list_values``` contains the full names of each country and the number of gold medals awarded. The values have been taken from [Wikipedia](#https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table).

Your job is to use these lists to construct a list of tuples, use the list of tuples to construct a dictionary, and then use that dictionary to construct a DataFrame. In doing so, you'll make use of the ```list()```, ```zip()```, ```dict()``` and ```pd.DataFrame()``` functions. Pandas has already been imported as pd.

Note: The [zip()](#https://docs.python.org/3/library/functions.html#zip) function in Python 3 and above returns a special zip object, which is essentially a generator. To convert this ```zip``` object into a list, you'll need to use ```list()```. You can learn more about the ```zip()``` function as well as generators in [Python Data Science Toolbox (Part 2)](#https://www.datacamp.com/courses/python-data-science-toolbox-part-2).

### Instructions

* Zip the 2 lists ```list_keys``` and ```list_values``` together into one list of (key, value) tuples. Be sure to convert the ```zip``` object into a list, and store the result in ```zipped```.
* Inspect the contents of ```zipped``` using ```print()```. This has been done for you.
* Construct a dictionary using ```zipped```. Store the result as ```data```.
* Construct a DataFrame using the dictionary. Store the result as ```df```.