# Practical Pandas for Data Wrangling
## Objectives
- read data using Pandas
- explore and analyze the data
- apply transformations to the data

# [10] Basic Mechanics of Pandas
Let's get familiar with the basic mechanics of the Pandas library.

In [1]:
url = "https://s3.amazonaws.com/python-level-2/sales-funnel.csv"

In [2]:
import pandas as pd

In [5]:
df = pd.read_csv(url)

In [6]:
df.head()

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status
0,714466,Trantow-Barrows,Craig Booker,Debra Henley,CPU,1,30000,presented
1,714466,Trantow-Barrows,Craig Booker,Debra Henley,Software,1,10000,presented
2,714466,Trantow-Barrows,Craig Booker,Debra Henley,Maintenance,2,5000,pending
3,737550,"Fritsch, Russel and Anderson",Craig Booker,Debra Henley,CPU,1,35000,declined
4,146832,Kiehn-Spinka,Daniel Hilton,Debra Henley,CPU,2,65000,won


**Let's see what columns are available.**

**Let's examine the beginning (head) and end (tail) of the data set**

**Let's examine the data types in this data set**

**Let's check out descriptive statistics**

In [2]:
#[.1, .3, .5, .7, .9])

In [3]:
# include all

# [30] Filtering
One of the main things we'll do on our data set is filter to the records that are relevant, where the definition of what is relevant depends on the problem we are trying to solve.

In [16]:
df.head()

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status
0,714466,Trantow-Barrows,Craig Booker,Debra Henley,CPU,1,30000,presented
1,714466,Trantow-Barrows,Craig Booker,Debra Henley,Software,1,10000,presented
2,714466,Trantow-Barrows,Craig Booker,Debra Henley,Maintenance,2,5000,pending
3,737550,"Fritsch, Russel and Anderson",Craig Booker,Debra Henley,CPU,1,35000,declined
4,146832,Kiehn-Spinka,Daniel Hilton,Debra Henley,CPU,2,65000,won


In [5]:
# df.iloc[:8]

In [6]:
# let's add a second dimension

In [7]:
# let's select with column names

In [8]:
# what is type of df?

In [21]:
df[df['Status']  == 'won']

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status
4,146832,Kiehn-Spinka,Daniel Hilton,Debra Henley,CPU,2,65000,won
9,141962,Herman LLC,Cedric Moss,Fred Anderson,CPU,2,65000,won
13,307599,"Kassulke, Ondricka and Metz",Wendy Yule,Fred Anderson,Maintenance,3,7000,won
14,688981,Keeling LLC,Wendy Yule,Fred Anderson,CPU,5,100000,won


In [9]:
# let's deconstruct this

## Exercise
How many accounts have a price greater than $12,000?

**Let's see how we get the maximum value of a certain column.**

## https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

## Exercise
What is the minimum contract price? What is the mean? And the standard deviation?

**Let's see how we can combine Boolean Masks to filter on multiple criteria.**

In [11]:
status_won = df['Status'] == 'won'

**Let's see what the result of this operation is...**

In [12]:
status_won

0     False
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8     False
9      True
10    False
11    False
12    False
13     True
14     True
15    False
16    False
Name: Status, dtype: bool

In [10]:
# let's do product_is_cpu

In [11]:
# let's combine them

**As an aside, there are many useful series methods, for example `unique`**

## Exercise
What is the total `Amount` (hint: you'll need to create a new column) for contracts that match following criteria

```
product either CPU or Software or Maintenance
manager is either Fred or Debra
```

Amount = Quantity * Price

In [12]:
# will show you a terse way

# [20] Aggregating Data / Pivot Tables

**Sometimes it's useful to get an aggregate view of our data.**

## https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html

Let's step through the following together!

**Let's pivot on one index.**

**Let's pivot on multiple indices**

**Let's reverse those indices**

**Let's specify which values we care about**

**Let's specify which columns we want broken down**

**Let's specify how we want the values to be aggregated (`aggfunc`)**

**Let's customize `aggfunc`**

**Let's fill N/A values**

**Let's get subtotals**

## Exercise
Create our own pivots that you think will be useful and then let's share and discuss.

### Mini-Exercise
Get total amount per rep, ONLY considering the contracts that are NOT declined.

# [60] Applying Transformations to the Data
## This is the "fun" part (it can be!) of data wrangling

In [65]:
df.head()

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status,Amount
0,714466,Trantow-Barrows,Craig Booker,Debra Henley,CPU,1,30000,presented,30000
1,714466,Trantow-Barrows,Craig Booker,Debra Henley,Software,1,10000,presented,10000
2,714466,Trantow-Barrows,Craig Booker,Debra Henley,Maintenance,2,5000,pending,10000
3,737550,"Fritsch, Russel and Anderson",Craig Booker,Debra Henley,CPU,1,35000,declined,35000
4,146832,Kiehn-Spinka,Daniel Hilton,Debra Henley,CPU,2,65000,won,130000


**Let's look at Python dictionaries.**
Let's make note of the syntax.

In [66]:
course_catalog = {
    'AST101': 'Astronomy 101: The Solar System',
    'MATH101': 'Pre-Algebra',
    'ENG304': 'Shakespeare and Donne'
}

In [67]:
course_catalog['AST101']

'Astronomy 101: The Solar System'

In [68]:
course_catalog['AST102']

KeyError: 'AST102'

**Let's talk about `lambda`**

In [81]:
def double_val(value):
    return value * 2
    

df["high_quantity"] = df["Quantity"].apply(double_val)


df["high_quantity"] = df["Quantity"].apply(lambda val: val * 2)

In [82]:
df.head()

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status,Amount,high_quantity
0,714466,Trantow-Barrows,Craig Booker,Debra Henley,CPU,1,30000,presented,30000,2
1,714466,Trantow-Barrows,Craig Booker,Debra Henley,Software,1,10000,presented,10000,2
2,714466,Trantow-Barrows,Craig Booker,Debra Henley,Maintenance,2,5000,pending,10000,4
3,737550,"Fritsch, Russel and Anderson",Craig Booker,Debra Henley,CPU,1,35000,declined,35000,2
4,146832,Kiehn-Spinka,Daniel Hilton,Debra Henley,CPU,2,65000,won,130000,4


In [83]:
CLIENT_CATEGORY_MAP = {
    'Trantow-Barrows': 'Accounting',
    'Fritsch, Russel and Anderson': 'Legal',
    'Kiehn-Spinka': 'Manufacturing',
    'Kulas Inc': 'Manufacturing',
    'Jerde-Hilpert': 'Accounting',
    'Barton LLC': 'Enterprise',
    'Herman LLC':  'Enterprise',
    'Purdy-Kunde': 'Legal',
    'Stokes LLC': 'Enterprise',
    'Kassulke, Ondricka and Metz': "Legal",
    'Koepp Ltd': 'Shipping',
    'Keeling LLC': "Enterprise"
    
}

### Mini-Exercise
Create a new column based on the above mapping and call it "Client_Category", using `lambda` and `.apply()`

**Now let's work with a different data set and keep practicing data munging!**

In [13]:
df = pd.read_csv(
    "https://raw.githubusercontent.com/suneel0101/lesson-plan/master/crunchbase_monthly_export.csv",
)

**What happened?**

In [15]:
# let's try with encoding='ISO-8859-1'

## Exercise
- Get descriptive statistics
- Look at datatypes, any issues?
- Get unique values for each of the column and see what they are (in particular category_list, market, status, and region)

## Exercise
Write a function that takes a value like what we see in `funding_total_usd` and returns a numeric value. Call it `transform_funding_total`, for exampe:

```python
def transform_funding_total(value):
    # some logic
    # transformed_value = ...
    return transformed_value
```

So that when it's called, it does the following:

```
>>> transform_funding_total('1,230,200')
1230200
```

In [16]:
# let's see the distribution of funding totals

**Now let's apply it and transform the column in question**

## Exercise (hint: Pivots)
- What is the average number of funding rounds for companies in NYC?

## Exercise
What are the top 3 markets with the highest average funding total per company?