<img src="pandas_white.svg" alt="pandas" width="500"/>

# DataFrame subsetting
This is the process of grabbing data from your DataFrame, extracting either by columns or rows.

## Selecting columns
To select columns in a DataFrame, use square-brackets the with name of column within them `df['column_name']`.  

In [None]:
import pandas as pd
import random
import numpy as np

In [None]:
df = pd.read_csv('name_apples_bookid.csv')
df

### Select a single column
Question:  What dtype is the data within the name column? object, float, int, category????

In [None]:
df['name']

In [None]:
df.name

Question: What is the data type is the name column?

In [None]:
type(df['name'])

### Selecting multiple columns

In [None]:
df[['name', 'apples']]

## Filtering rows
We can subset our DataFrame by filter rows based on certain conditions.  We evaluate conditional expressions and rows that meet those conditions will be returned.  The conditional expression returns a series of boolean values.  If the boolean values are placed within square brackets of a DataFrame then only the True values will be returned.

### Filtering with a single conditional expression

In [None]:
# Conditional expression that returns True/False
# for every value in apple column that is greater than
# or equal to 16
(df['apples'] >= 16)

In [None]:
df.apples >= 16

In [None]:
# If we place the conditional expression
# within a DataFrame square bracket then it will return
# all rows in which the conditional expression is True
df[(df['apples'] >= 16)]
df[(df.apples >= 16)]

### Filtering with multiple conditional expressions
When filtering with multiple conditions, the conditional expressions must be encapsulated in parentheses.  Additional, the conditions must be joined by either an ampersand `&` which represents "and" or a pipe `|` which represents "or"

In [None]:
df[(df['apples'] >=16) & (df['name'] == 'Molly')]

In [None]:
df[(df['apples'] >=16) | (df['name'] == 'Larry')]

In [None]:
older = (df['apples'] >=16)
older_books = df[older]['book_id']
older_books

## Select rows by position - df.iloc and df.loc
Rows can also be extracted by index based or label based selection.
DataFrames have two methods for this purpose.  The first is `iloc` in which the "i" states for integer.  `iloc` uses the index number of each row or column to extract them.  While `loc` uses the row or column label to extract them

In [None]:
# Change the index of of DataFrame to strings so that we can 
# more easily demonstrate label based indexing
df.index = ['teacher', 'author', 'student', 'TA']

### iloc on a single row

In [None]:
df.iloc[0]

### use row and column with iloc
To accomplish this, the first value represents the row index and the second value represents the column index.

In [None]:
# select index 0 and column 0
df.iloc[0,0]

### iloc a range of rows

In [None]:
df.iloc[0:2]

### iloc a range of rows and columns

In [None]:
df.iloc[0:2, 0:2]

### loc on a single row

In [None]:
df.loc['teacher']

### Use row label and column label with loc

In [None]:
df.loc['student', 'name']

### Extract a range of rows with loc

In [None]:
df.loc['teacher':'TA']

### Extract a range of rows and columns with loc

In [None]:
df.loc['teacher':'TA', 'apples':'book_id']

## Add and remove rows and columns
In order to permanently remove a column or row from a DataFrame you will need assign the transformed DataFrame to another variable or use the `inplace=True` parameter.

### Add a column

In [None]:
df['money'] = [45000, 30000, 15000, np.nan]

### Add a row

In [None]:
df.loc['Dean'] = ['Betty', 18, 101, 90000]
df = df.append({'name':'Charles', 'apples':19,
           'book_id':101, 'money': 120000}, ignore_index=True)

### Add an additional column for years of experience

In [None]:
experience = []
for _ in range(0, df.shape[0]): # Read about range here: https://docs.python.org/3/library/functions.html#func-range  
    years = random.randint(1,15) # Read about randint here: https://docs.python.org/3/library/random.html#random.randint
    experience.append(years)
    
df['experience'] = experience

In [None]:
df.to_csv('salary.csv', index=False)

### Remove a row

In [None]:
df_no_matt = df.drop(1, axis=0) # inplace=True
df_no_matt

### Remove a column

In [None]:
df_no_apples = df.drop('apples', axis=1) # inplace=True
df_no_apples

# DataFrame Attributes
Like methods, attributes are specific to a data type.  The attributes are accessed with the dot notation you use for methods but no parentheses are used.  Attribute simply return information stored by the object.  We will discuss just five of the many attributes of a DataFrame.  We discussed 2 previously, `shape` and `dtype(s)` The others are:
* columns
* index
* values

## viewing and setting columns
In bigger datasets it is hard to visualize all the columns within a DataFrame.  You can access the column name with the column attribute of the DataFrame. 

### View columns

In [None]:
df.columns

### set columns

In [None]:
df.columns = [1,2,3,4,5]

## viewing and setting columns

### View index

In [None]:
df.index

### Set index

In [None]:
df.index = [1,3,5,7,9,10]

In [None]:
df

In [None]:
df.set_index(1)

### Display DataFrame and Series values

#### DataFrame

In [None]:
df.values

#### Series

In [None]:
df[1].values

# Pair Programming
The above code, created a csv file named "salary.csv".  It represents the yearly salary for different positions at Virtual University, books (by id) they have read, and the number of years of experience they have.

1. Import the salary data into a pandas DataFrame
2. There is one book that has been read by several people.  Extract all rows with that book.
3. Add a new row and column to the DataFrame( Just make up any data that you like)
4. Filter the data set for people with less than 5 years experience but make more than \\$30,000 (None is a valid answer)
5. Change index to 1,3,5,7,9,10,12 and filter and return the row with index 9

# DataFrame methods
Similarily to other data types DataFrames have methods attached to them also.  These methods can modify or describe the entire content of the DataFrame
* sample() - return a random sample of data
* nunique() - return the number of unique elements in each column
* isna() - determine if value within the DataFrame is `nan`
* dropna() - Remove row or column that contains `nan` value
* duplicated() - returns boolean series with True if value is duplicated
* drop_duplicates() - remove rows with duplicate values
* apply() - apply a function to values within the DataFrame

In [None]:
## sample()
df.sample(3)

In [None]:
# To determine the number of unique values in each column
df.nunique()

In [None]:
# Determine which values are not a number(nan)
df.isna()

In [None]:
# remove row or column with `nan` value
df.dropna(axis='index')

In [None]:
df.duplicated(3)

In [None]:
# remove duplicate rows
df.drop_duplicates(3, keep='first')

In [None]:
# Use apply() to apply a function to all the values within a DataFrame
numbers = pd.DataFrame({'A': [1,2,3], 'B': [10,20,30] })

def plus_10(x):
    return x+10

In [None]:
numbers

In [None]:
numbers.apply(plus_10)

# Summarize DataFrame
Examine the descriptive stats with the `describe` method.  
It provides column count, mean, min, max, and stand deviation.  
It also include the 25,50, and 75 percentiles.

In [None]:
df

In [None]:
df.describe()

# DataFrame manipulation
There are DataFrame methods that allow you to manipulate its structure.
* df.sort_index() - sort the index or columns in alphabetical/numerical order
* df.reset_index() - change the index back to a Zero beginning then add previous index to be a new column
* df.sort_values() - short all data based on a specific column
* df.rename()

In [None]:
df.columns = ['name', 'apples', 'book_id', 'money', 'experience']

In [None]:
df = df.set_index('name')

In [None]:
df

## df.sort_index()
Sort labels across axis 0 or axis 1

In [None]:
df.sort_index(axis=0)

In [None]:
df.sort_index(axis=1)

## df.reset_index()
Reset the indexing so that the first row is index 0.  It also places the current index as a column unless you specify a parameter of `drop = True`

In [None]:
df = df.reset_index() # use drop=True to remove the name column

In [None]:
df

## df.sort_values('name')

In [None]:
df.sort_values('money')

## df.rename()
Change the index or column label name

In [None]:
df.rename({'name':'first_name', 'apples':'fruit', 'book_id':'isbn', 'money':'income'}, axis='columns')

In [None]:
df.rename({0:6, 3:8}, axis='index')

In [None]:
df.rename(str.upper, axis='columns')

# Reshape DataFrame
DataFrames can be restructured
* pd.melt() - transform wide data into long formatted data....gather columns into rows
* pd.concat() - append rows or columns of DataFrames

## pd.melt()

In [None]:
df.melt()

In [None]:
# You can use as many columns as you want for id_vars and value_vars
melted = df.melt(id_vars=['name', 'book_id'], value_vars=['apples', 'money', 'experience'])
melted

## pd.concat()
Pandas provides the ability to easily combine DataFrames.

In [None]:
df1 = pd.DataFrame({
    'name': ['Paul', 'Ryan', 'Ashley', 'Donald'],
    'math': [60,89,82,70],
    'physics': [66,95,83,66],
    'chemistry': [61,91,77,70]
})
df2 = pd.DataFrame({
    'name': ['Eddie', 'Frank', 'Greg', 'Hans'],
    'math': [66,95,83,66],
    'physics': [60,89,82,70],
    'chemistry': [90,81,78,90]
})

In [None]:
pd.concat([df1, df2])

In [None]:
pd.concat([df1,df2], axis=1)

In [None]:
# Add a hierarchical index with keys and names
res = pd.concat([df1, df2], keys=['Year 1', 'Year 2'], names=['Class', None])
res

# Series methods and attributes
There are also methods attached to Series.  We will cover some the most used.
* value_counts() [docs](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html)
* astype() [docs](https://pandas.pydata.org/docs/reference/api/pandas.Series.astype.html)
* size() [docs](https://pandas.pydata.org/docs/reference/api/pandas.Series.size.html?highlight=size#pandas.Series.size)
* count() [docs](https://pandas.pydata.org/docs/reference/api/pandas.Series.count.html?highlight=count#pandas.Series.count)
* min() [docs](https://pandas.pydata.org/docs/reference/api/pandas.Series.min.html?highlight=min#pandas.Series.min0)
* max() [docs](https://pandas.pydata.org/docs/reference/api/pandas.Series.max.html?highlight=max#pandas.Series.max)
* mean() [docs](https://pandas.pydata.org/docs/reference/api/pandas.Series.mean.html)
* median() [docs](https://pandas.pydata.org/docs/reference/api/pandas.Series.median.html)
* std() [docss](https://pandas.pydata.org/docs/reference/api/pandas.Series.std.html)
* isna() [docs](https://pandas.pydata.org/docs/reference/api/pandas.isna.html)
* fillna() [docs](https://pandas.pydata.org/docs/reference/api/pandas.Series.fillna.html)
* describe() [docs](https://pandas.pydata.org/docs/reference/api/pandas.Series.describe.html)
* quantile() [docs](https://pandas.pydata.org/docs/reference/api/pandas.Series.quantile.html)
* unique() [docs](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.htm)
* sample() [docs](https://pandas.pydata.org/docs/reference/api/pandas.Series.sample.html)
* cumsum() [docs](https://pandas.pydata.org/docs/reference/api/pandas.Series.cumsum.html)
* replace() [docs](https://pandas.pydata.org/docs/reference/api/pandas.Series.replace.html)

## Value counts
Counts the number of times a value appears in a column

In [None]:
df['book_id'].value_counts()

In [None]:
# Get percentage of Total
df['book_id'].value_counts(normalize=True)

In [None]:
# it doesn't count missing values
df['money'].value_counts()

## astype

In [None]:
salary = pd.read_csv('salary.csv')
salary['education'] = ['PHD', 'MS', 'BA', 'BA', 'MS', 'PHD']
salary['age'] = [34, 26, 22, 23, 28, 41]
salary['weight_lb'] = [179, 98, 139, 99, 220, 191]

In [None]:
salary.dtypes

In [None]:
salary['education'] = salary['education'].astype('category')

In [None]:
salary.dtypes

## Size
The number of values in the column including `nan`

In [None]:
salary.name.size

## Count
The number of values in the column excluding `nan`

In [None]:
salary.money.count()

## Min,max,mean, median, std

In [None]:
salary.money.min()

In [None]:
salary.money.max()

In [None]:
salary.experience.mean()

In [None]:
salary

In [None]:
young = (salary.apples <= 16)
salary[young]['money'].mean()

In [None]:
salary.experience.median()

In [None]:
salary.money.std()

## isna(), fillna(), dropna()
The ability to detect and manage missing values is critical in data analysis

In [None]:
salary['money'].isna()

In [None]:
salary[salary['money'].isna()]

In [None]:
salary['money'].fillna(salary['money'].mean())

In [None]:
salary

## describe()

In [None]:
salary.experience.describe()

## quantile()

In [None]:
salary.money.quantile()

## unique()

In [None]:
salary.book_id.unique()

## sum() and cumsum()

In [None]:
sales = pd.read_excel('https://github.com/lwgray/mediumdata/raw/master/sample_pivot.xlsx',
                      engine='openpyxl', parse_dates=['Date'])
print(sales.shape)
sales.head()

In [None]:
sales.Units.sum()

In [None]:
sales.Sales.cumsum()

## sample()

In [None]:
sales.sample(500)

In [None]:
sales.to_csv('sales.csv')

In [None]:
## replace()

In [None]:
# 98 and 99 weights are codes for unknown
salary.weight_lb.replace([98,99], np.nan)

# Pair Programming
Congratulations!  After making through Week 3, I wanted to award you with a exciting challenge.  You have been given the title of Junior Data Analyst for a board/card game company called 'High-Five Games'.  Their most successful fantasy card game 'Monsters under my bed'. The company would like you to produce a report the converts the player purchasing data into meaningful insights.

Use your pandas skills to answer the following questions.
All data is contained within monster_trading_cards.csv

## Number of Players
* What is the total number of unique players(user_name)

## Calculate Totals
* Number of Unique Monsters
* Average Purchase Price
* Total Number of Purchases
* Total Revenue

## Breakdown the Gender Demographics
* % and number of males
* % and number of females
* % and number of non-disclosed

## Purchases by men
* number of purchases
* Average Purchase Price
* Total Purchase amount