<img src="pandas_white.svg" alt="pandas" width="500"/>
pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.  pandas aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.

## Highlights
* A fast and efficient DataFrame object for data manipulation with integrated indexing;

* Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;

* Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;

* Flexible reshaping and pivoting of data sets;

* Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;

* Columns can be inserted and deleted from data structures for size mutability;

* Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;

* High performance merging and joining of data sets;

* Time series-functionality: date range generation and frequency conversion.

* Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

## Installation Instructions:
Go to your terminal / bash / commandline / powershell and enter in the following command

--> `pip install pandas`

It is common practice to import pandas as pd

In [None]:
import pandas as pd

## Pandas Data Types
Pandas essentially has three data types of its own (remember data types are like list, dict, and str)
1. DataFrame
2. Series
3. Panel

We will be focusing on DataFrames and Series.

## Anatomy of a DataFrame and Series

DataFrames are the primary pandas data structure.  They are size-mutable and cant potentially contain heterogeneous tabular data.  The DataFrame is a data type that takes on the basic structure of a two-dimensional table containing rows and columns.  This is the same structure that we all are familiar with because we have seen it several times in MS Excel. 

In the image below is a visualization of the anatomy of a DataFrame.  The columns are identified by column labels that run atop of the columns and the rows are identified by index labels that from top to bottom along the left-hand side of the rows.  The index is the way a DataFrame labels rows.  

The columns are also known as Axis 1/columns or while the index are also known as Axis 0/index.  
Missing values with a DataFrame are noted by `NaN` (Not a number). 

Truncated Data is noted by elipsis (when there are many rows or columns of data, pandas by default with reduce what you are able to display on the screen).

A Series is simply defined as a column thus in reality the DataFrame is a composite of Series.

<img src="dataframe_anatomy.png" alt="pandas" width="500"/>

## Creating DataFrames
Documentation: [Pandas.DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)
DataFrames can be defined several different ways, either via python data types or using files such as csv or excel.  We demonstrate each way below.

### Creating a DataFrame with a list of lists

In [None]:
data_list = [
                ['Larry',2,3,4],
                ['Matt',6,7,8], 
                ['Kass', 9, 10, 11], 
                ['Ben', 12, 13,  14]
            ]
bubbles_n_balls = pd.DataFrame(data_list, columns=['name','bubbles', 'songs', 'balls'])

In [None]:
bubbles_n_balls

### Creating a DataFrame with a dictionary

In [None]:
data_dict = {
             'name':['Larry','Matt','Kass', 'Molly'],
             'apples':[14, 15, 16, 17],
             'book_id': [123, 456, 789, 101],
            }
name_apples_bookid = pd.DataFrame(data_dict)

In [None]:
name_apples_bookid

### Determine the dimensions of a DataFrame
It is possible to determine the shape of a DataFrame (ie the number of rows and columns).  You use the `pd.DataFrame.shape` attribute.  It returns a tuple with (number of rows, number of columns)

In [None]:
name_apples_bookid.shape

In [None]:
len(name_apples_bookid) #provides the number of rows in DataFrame

### Getting a glimpse of the data
You can use the pd.DataFrame.head() and pd.DataFrame.tail() methods to get a glimpse of the first or last 5 rows within a DataFrame, respectively.  The methods do take an argument for which you can select the number of rows you would like to see, by default it is set to 5 if you don't specify it.

In [None]:
name_apples_bookid.head(2)

In [None]:
name_apples_bookid.tail(2)

### Creating a DataFrame from a flat files
Documentation: [Type of flat files accepted by Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)
It is possible to read data into Pandas from flat files such as csv or excel.  DataFrame data type has methods attached to them. `pd.DataFram.read_csv` takes the path of a csv file and converts it into a DataFrame object.  `pd.DataFrame.read_excel` takes the path of a excel file and converts it into a DataFrame object.

Documentation: [Pandas.DataFrame.read_csv](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.read_csv.html)

Documentation: [Pandas.DataFrame.read_excel](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.read_excel.html)
- requires python package `pip install openpyxl`

## Reading data from Flat Files
Documentation: [Pandas.DataFrame.read_table](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.read_table.html)

You should always manually inspect your data before reading it into pandas.  You should take note of how it is structured into columns and rows.

### Writing DataFrame to File
Documentation: [Pandas.DataFrame.to_csv](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html)

Documentation: [Pandas.DataFrame.to_excel](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html)
- requires python package `pip install openpyxl`

Documentation: [Pandas.ExcelWriter](https://pandas.pydata.org/docs/reference/api/pandas.ExcelWriter.html)
- can create multiple sheets within a workbook
- requires python package `openpyxl`

#### Write CSV file from DataFrame

In [None]:
name_apples_bookid.to_csv('name_apples_bookid.csv', index=False)

#### Create DataFrame from CSV file

In [None]:
csv = pd.read_csv('name_apples_bookid.csv')

In [None]:
print(type(csv))

#### Write excel file

In [None]:
name_apples_bookid.to_excel('name_apples_bookid.xlsx', engine='openpyxl', index=False)
excel = pd.read_excel('name_apples_bookid.xlsx', engine='openpyxl')
excel

### Read a table file
Always look at your files before trying to import them

In [None]:
unames = ['category', 'book_id', 'book_name', 'rank', 'sales', 'type', 'sold_out', 'best_seller']
books = pd.read_table('book_categories.txt', sep='--', names = unames, engine='python', header=0)

books

## Creating Series
Documentation: [Pandas.Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)
The Series data type is a pandas representation of a column.  This data type has specific methods that allow you to manipulate the data within the Series.  There are also attributes that provide information about state of the series.  We will discuss both methods and attributes later on.  The values with a series are generally all the same type, for instance strings, integers, floats, etc...

### Creating Series with a list

In [None]:
animals_list = ['monkey','kangaroo','horse','cat']
series_list_1 = pd.Series(data=animals_list)
series_list_1

In [None]:
series_list_2 = pd.Series(animals_list, index=['a','b','c','d'])
series_list_2

### Creating Series with a dictionary

In [None]:
animals_dict = {'a':'monkey', 'b':'kangaroo', 'c':'horse', 'd':'cat'}
series_dict = pd.Series(data=animals_dict)
series_dict

### Series Operations

In [None]:
series_dict + '_tail'

In [None]:
series_one_to_ten = pd.Series([1,2,3,4,5,6,7,8,9,10, None])
print('Multiply \n', series_one_to_ten * 10, '\n\n')
print('floor division \n', series_one_to_ten // 2.5, '\n\n')
print('modulus \n', series_one_to_ten % 2, '\n\n')
print('greater_than \n', series_one_to_ten > 5, '\n\n')

### dtypes
Columns within a DataFrame(ie Series) can be composed of different data types.  The type determines which methods are available to manipulate the column. below are a few of them
`object` which represents strings or when mixed values (A column contains many types of values)
`int64` which represents integers
`float64` which represents floats
`bool` which represents bool
`datetime[ns]` which represents data and time
`category` which represents a categorical value

* float – The NumPy float type, which supports missing values
* int – The NumPy integer type, which does not support missing values
* 'Int64' – pandas nullable integer type
* object – The NumPy type for storing strings (and mixed types)
* 'category' – pandas categorical type, which does support missing values
* bool – The NumPy Boolean type, which does not support missing values (None becomes False, np.nan becomes True)
* 'boolean' – pandas nullable Boolean type
* datetime64[ns] – The NumPy date type, which does support missing values (NaT)

#### Access dtypes from DataFrame and Series

In [None]:
name_apples_bookid.dtypes

In [None]:
series_dict.dtype

# Pair Programming

# Week 3 - Pair Programming 
The data file `supermarket_sales.csv` for this exercise is located in the Canvas `Files` tab within the Module 2 Folder and finally in Week3 Folder.

#### Question 1
1. Read the `supermarket_sales.csv` file into Pandas and assign it to a variable named `df`

#### Question 2
2. Is `df` a Series or DataFrame?

#### Question 3
3. How many columns and rows are within the `df`?

#### Question 4
4. How many of each dtype is contained within the `df` ?

#### Question 5
5. Write `df` to an excel file