# Practical Pandas for Data Wrangling
## Objectives
- read data using Pandas
- explore and analyze the data
- apply transformations to the data

# [10] Basic Mechanics of Pandas
Let's get familiar with the basic mechanics of the Pandas library.

In [1]:
url = "https://s3.amazonaws.com/python-level-2/sales-funnel.csv"

In [2]:
import pandas as pd

In [5]:
df = pd.read_csv(url)

In [6]:
df.head()

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status
0,714466,Trantow-Barrows,Craig Booker,Debra Henley,CPU,1,30000,presented
1,714466,Trantow-Barrows,Craig Booker,Debra Henley,Software,1,10000,presented
2,714466,Trantow-Barrows,Craig Booker,Debra Henley,Maintenance,2,5000,pending
3,737550,"Fritsch, Russel and Anderson",Craig Booker,Debra Henley,CPU,1,35000,declined
4,146832,Kiehn-Spinka,Daniel Hilton,Debra Henley,CPU,2,65000,won


**Let's see what columns are available.**

In [7]:
df.columns

Index(['Account', 'Name', 'Rep', 'Manager', 'Product', 'Quantity', 'Price',
       'Status'],
      dtype='object')

**Let's examine the beginning (head) and end (tail) of the data set**

In [10]:
df.tail(3)

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status
14,688981,Keeling LLC,Wendy Yule,Fred Anderson,CPU,5,100000,won
15,729833,Koepp Ltd,Wendy Yule,Fred Anderson,CPU,2,65000,declined
16,729833,Koepp Ltd,Wendy Yule,Fred Anderson,Monitor,2,5000,presented


**Let's examine the data types in this data set**

In [11]:
df.dtypes

Account      int64
Name        object
Rep         object
Manager     object
Product     object
Quantity     int64
Price        int64
Status      object
dtype: object

**Let's check out descriptive statistics**

In [12]:
df.describe()

Unnamed: 0,Account,Quantity,Price
count,17.0,17.0,17.0
mean,462254.235294,1.764706,30705.882353
std,259093.442862,1.032558,28444.605609
min,141962.0,1.0,5000.0
25%,218895.0,1.0,7000.0
50%,412290.0,2.0,30000.0
75%,714466.0,2.0,40000.0
max,740150.0,5.0,100000.0


In [14]:
df.describe([.1, .3, .5, .7, .9])

Unnamed: 0,Account,Quantity,Price
count,17.0,17.0,17.0
mean,462254.235294,1.764706,30705.882353
std,259093.442862,1.032558,28444.605609
min,141962.0,1.0,5000.0
10%,156782.4,1.0,5000.0
30%,235254.2,1.0,9400.0
50%,412290.0,2.0,30000.0
70%,714466.0,2.0,36000.0
90%,732919.8,2.4,65000.0
max,740150.0,5.0,100000.0


In [15]:
df.describe(include='all')

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status
count,17.0,17,17,17,17,17.0,17.0,17
unique,,12,5,2,4,,,4
top,,Trantow-Barrows,Craig Booker,Debra Henley,CPU,,,presented
freq,,3,4,9,9,,,6
mean,462254.235294,,,,,1.764706,30705.882353,
std,259093.442862,,,,,1.032558,28444.605609,
min,141962.0,,,,,1.0,5000.0,
25%,218895.0,,,,,1.0,7000.0,
50%,412290.0,,,,,2.0,30000.0,
75%,714466.0,,,,,2.0,40000.0,


# [30] Filtering
One of the main things we'll do on our data set is filter to the records that are relevant, where the definition of what is relevant depends on the problem we are trying to solve.

In [16]:
df.head()

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status
0,714466,Trantow-Barrows,Craig Booker,Debra Henley,CPU,1,30000,presented
1,714466,Trantow-Barrows,Craig Booker,Debra Henley,Software,1,10000,presented
2,714466,Trantow-Barrows,Craig Booker,Debra Henley,Maintenance,2,5000,pending
3,737550,"Fritsch, Russel and Anderson",Craig Booker,Debra Henley,CPU,1,35000,declined
4,146832,Kiehn-Spinka,Daniel Hilton,Debra Henley,CPU,2,65000,won


In [19]:
df.iloc[:8, :3]

Unnamed: 0,Account,Name,Rep
0,714466,Trantow-Barrows,Craig Booker
1,714466,Trantow-Barrows,Craig Booker
2,714466,Trantow-Barrows,Craig Booker
3,737550,"Fritsch, Russel and Anderson",Craig Booker
4,146832,Kiehn-Spinka,Daniel Hilton
5,218895,Kulas Inc,Daniel Hilton
6,218895,Kulas Inc,Daniel Hilton
7,412290,Jerde-Hilpert,John Smith


In [23]:
df[["Name", "Rep"]].head()

Unnamed: 0,Name,Rep
0,Trantow-Barrows,Craig Booker
1,Trantow-Barrows,Craig Booker
2,Trantow-Barrows,Craig Booker
3,"Fritsch, Russel and Anderson",Craig Booker
4,Kiehn-Spinka,Daniel Hilton


In [20]:
type(df)

pandas.core.frame.DataFrame

In [21]:
df[df['Status']  == 'won']

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status
4,146832,Kiehn-Spinka,Daniel Hilton,Debra Henley,CPU,2,65000,won
9,141962,Herman LLC,Cedric Moss,Fred Anderson,CPU,2,65000,won
13,307599,"Kassulke, Ondricka and Metz",Wendy Yule,Fred Anderson,Maintenance,3,7000,won
14,688981,Keeling LLC,Wendy Yule,Fred Anderson,CPU,5,100000,won


In [27]:
df['Status']  == 'won'

0     False
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8     False
9      True
10    False
11    False
12    False
13     True
14     True
15    False
16    False
Name: Status, dtype: bool

## Exercise
How many accounts have a price greater than $12,000?

In [31]:
df[df['Price'] > 12000]

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status
0,714466,Trantow-Barrows,Craig Booker,Debra Henley,CPU,1,30000,presented
3,737550,"Fritsch, Russel and Anderson",Craig Booker,Debra Henley,CPU,1,35000,declined
4,146832,Kiehn-Spinka,Daniel Hilton,Debra Henley,CPU,2,65000,won
5,218895,Kulas Inc,Daniel Hilton,Debra Henley,CPU,2,40000,pending
8,740150,Barton LLC,John Smith,Debra Henley,CPU,1,35000,declined
9,141962,Herman LLC,Cedric Moss,Fred Anderson,CPU,2,65000,won
10,163416,Purdy-Kunde,Cedric Moss,Fred Anderson,CPU,1,30000,presented
14,688981,Keeling LLC,Wendy Yule,Fred Anderson,CPU,5,100000,won
15,729833,Koepp Ltd,Wendy Yule,Fred Anderson,CPU,2,65000,declined


In [33]:
df[df.Price > 12000][:3]

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status
0,714466,Trantow-Barrows,Craig Booker,Debra Henley,CPU,1,30000,presented
3,737550,"Fritsch, Russel and Anderson",Craig Booker,Debra Henley,CPU,1,35000,declined
4,146832,Kiehn-Spinka,Daniel Hilton,Debra Henley,CPU,2,65000,won


**Let's see how we get the maximum value of a certain column.**

## https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

In [34]:
df['Quantity'].max()

5

## Exercise
What is the minimum contract price? What is the mean? And the standard deviation?

In [35]:
df.Price.min()

5000

In [36]:
df.Price.std()

28444.605608714177

In [37]:
df.Price.mean()

30705.882352941175

**Let's see how we can combine Boolean Masks to filter on multiple criteria.**

In [11]:
status_won = df['Status'] == 'won'

**Let's see what the result of this operation is...**

In [12]:
status_won

0     False
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8     False
9      True
10    False
11    False
12    False
13     True
14     True
15    False
16    False
Name: Status, dtype: bool

In [13]:
product_cpu = df['Product'] == 'CPU'

In [15]:
df[status_won & product_cpu]

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status
4,146832,Kiehn-Spinka,Daniel Hilton,Debra Henley,CPU,2,65000,won
9,141962,Herman LLC,Cedric Moss,Fred Anderson,CPU,2,65000,won
14,688981,Keeling LLC,Wendy Yule,Fred Anderson,CPU,5,100000,won


**As an aside, there are many useful series methods, for example `unique`**

In [16]:
df['Product'].unique()

array(['CPU', 'Software', 'Maintenance', 'Monitor'], dtype=object)

## Exercise
What is the total `Amount` (hint: you'll need to create a new column) for contracts that match following criteria

```
product either CPU or Software or Maintenance
manager is either Fred or Debra
```

Amount = Quantity * Price

In [38]:
df['Amount'] = df['Quantity'] * df['Price']

In [40]:
is_desired_prod = (df['Product'] == 'CPU') | (df['Product'] == 'Software') | (df['Product'] == 'Maintenance')

In [47]:
is_prod = df.Product.isin(['CPU', 'Software', 'Maintenance'])
is_manager = df.Manager.isin(['Debra Henley', 'Fred Anderson'])
df[is_prod & is_manager]['Amount'].sum()

1176000

# [20] Aggregating Data / Pivot Tables

**Sometimes it's useful to get an aggregate view of our data.**

## https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html

Let's step through the following together!

**Let's pivot on one index.**

In [48]:
pd.pivot_table(df, index='Manager')

Unnamed: 0_level_0,Account,Amount,Price,Quantity
Manager,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Debra Henley,513112.222222,38888.888889,26111.111111,1.444444
Fred Anderson,405039.0,104500.0,35875.0,2.125


**Let's pivot on multiple indices**

In [49]:
pd.pivot_table(df, index=['Rep', 'Manager'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Account,Amount,Price,Quantity
Rep,Manager,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Cedric Moss,Fred Anderson,196016.5,43750.0,27500.0,1.25
Craig Booker,Debra Henley,720237.0,21250.0,20000.0,1.25
Daniel Hilton,Debra Henley,194874.0,73333.333333,38333.333333,1.666667
John Smith,Debra Henley,576220.0,22500.0,20000.0,1.5
Wendy Yule,Fred Anderson,614061.5,165250.0,44250.0,3.0


**Let's reverse those indices**

In [50]:
pd.pivot_table(df, index=['Manager', 'Rep'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Account,Amount,Price,Quantity
Manager,Rep,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Debra Henley,Craig Booker,720237.0,21250.0,20000.0,1.25
Debra Henley,Daniel Hilton,194874.0,73333.333333,38333.333333,1.666667
Debra Henley,John Smith,576220.0,22500.0,20000.0,1.5
Fred Anderson,Cedric Moss,196016.5,43750.0,27500.0,1.25
Fred Anderson,Wendy Yule,614061.5,165250.0,44250.0,3.0


**Let's specify which values we care about**

In [56]:
pd.pivot_table(
    df,
    index=['Manager', 'Rep'],
    values=['Amount'],
    columns=['Status'],
    fill_value=0,
    aggfunc='sum'
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Amount,Amount,Amount,Amount
Unnamed: 0_level_1,Status,declined,pending,presented,won
Manager,Rep,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Debra Henley,Craig Booker,35000,10000,40000,0
Debra Henley,Daniel Hilton,0,80000,10000,130000
Debra Henley,John Smith,35000,10000,0,0
Fred Anderson,Cedric Moss,0,5000,40000,130000
Fred Anderson,Wendy Yule,130000,0,10000,521000


**Let's specify which columns we want broken down**

**Let's specify how we want the values to be aggregated (`aggfunc`)**

In [59]:
pd.pivot_table(
    df,
    index=['Manager', 'Rep'],
    values=['Amount', 'Price'],
    columns=['Status'],
    fill_value=0,
    aggfunc={'Amount': 'sum', 'Price': 'mean'}
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Amount,Amount,Amount,Amount,Price,Price,Price,Price
Unnamed: 0_level_1,Status,declined,pending,presented,won,declined,pending,presented,won
Manager,Rep,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Debra Henley,Craig Booker,35000,10000,40000,0,35000,5000,20000,0
Debra Henley,Daniel Hilton,0,80000,10000,130000,0,40000,10000,65000
Debra Henley,John Smith,35000,10000,0,0,35000,5000,0,0
Fred Anderson,Cedric Moss,0,5000,40000,130000,0,5000,20000,65000
Fred Anderson,Wendy Yule,130000,0,10000,521000,65000,0,5000,53500


**Let's fill N/A values**

**Let's get subtotals**

## Exercise
Create our own pivots that you think will be useful and then let's share and discuss.

In [61]:
pd.pivot_table(df, 
               index=['Rep', 'Name', 'Product'], 
               values=['Amount', 'Quantity'], 
               columns=['Status'], 
               fill_value=0,
               aggfunc={'Quantity': 'sum', 'Amount': 'sum'},
               margins=True,
               margins_name='Totals'
              )

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Amount,Amount,Amount,Amount,Amount,Quantity,Quantity,Quantity,Quantity,Quantity
Unnamed: 0_level_1,Unnamed: 1_level_1,Status,declined,pending,presented,won,Totals,declined,pending,presented,won,Totals
Rep,Name,Product,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
Cedric Moss,Herman LLC,CPU,0,0,0,130000,130000,0,0,0,2,2
Cedric Moss,Purdy-Kunde,CPU,0,0,30000,0,30000,0,0,1,0,1
Cedric Moss,Stokes LLC,Maintenance,0,5000,0,0,5000,0,1,0,0,1
Cedric Moss,Stokes LLC,Software,0,0,10000,0,10000,0,0,1,0,1
Craig Booker,"Fritsch, Russel and Anderson",CPU,35000,0,0,0,35000,1,0,0,0,1
Craig Booker,Trantow-Barrows,CPU,0,0,30000,0,30000,0,0,1,0,1
Craig Booker,Trantow-Barrows,Maintenance,0,10000,0,0,10000,0,2,0,0,2
Craig Booker,Trantow-Barrows,Software,0,0,10000,0,10000,0,0,1,0,1
Daniel Hilton,Kiehn-Spinka,CPU,0,0,0,130000,130000,0,0,0,2,2
Daniel Hilton,Kulas Inc,CPU,0,80000,0,0,80000,0,2,0,0,2


In [63]:
pd.pivot_table(df,
               index=['Product'],
               values=['Quantity', 'Price'],
               columns =['Rep'],
               fill_value = 0,
               aggfunc={'Quantity':'sum', 'Price':'mean'}
)

Unnamed: 0_level_0,Price,Price,Price,Price,Price,Quantity,Quantity,Quantity,Quantity,Quantity
Rep,Cedric Moss,Craig Booker,Daniel Hilton,John Smith,Wendy Yule,Cedric Moss,Craig Booker,Daniel Hilton,John Smith,Wendy Yule
Product,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
CPU,47500,32500,52500,35000,82500,3,2,4,1,7
Maintenance,5000,5000,0,5000,7000,1,2,0,2,3
Monitor,0,0,0,0,5000,0,0,0,0,2
Software,10000,10000,10000,0,0,1,1,1,0,0


### Mini-Exercise
Get total amount per rep, ONLY considering the contracts that are NOT declined.

In [64]:
pd.pivot_table(df[df.Status != 'declined'], index='Rep', values='Amount', aggfunc='sum')

Unnamed: 0_level_0,Amount
Rep,Unnamed: 1_level_1
Cedric Moss,175000
Craig Booker,50000
Daniel Hilton,220000
John Smith,10000
Wendy Yule,531000


# [60] Applying Transformations to the Data
## This is the "fun" part (it can be!) of data wrangling

In [65]:
df.head()

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status,Amount
0,714466,Trantow-Barrows,Craig Booker,Debra Henley,CPU,1,30000,presented,30000
1,714466,Trantow-Barrows,Craig Booker,Debra Henley,Software,1,10000,presented,10000
2,714466,Trantow-Barrows,Craig Booker,Debra Henley,Maintenance,2,5000,pending,10000
3,737550,"Fritsch, Russel and Anderson",Craig Booker,Debra Henley,CPU,1,35000,declined,35000
4,146832,Kiehn-Spinka,Daniel Hilton,Debra Henley,CPU,2,65000,won,130000


**Let's look at Python dictionaries.**
Let's make note of the syntax.

In [66]:
course_catalog = {
    'AST101': 'Astronomy 101: The Solar System',
    'MATH101': 'Pre-Algebra',
    'ENG304': 'Shakespeare and Donne'
}

In [67]:
course_catalog['AST101']

'Astronomy 101: The Solar System'

In [68]:
course_catalog['AST102']

KeyError: 'AST102'

In [69]:
'ENG304' in course_catalog

True

In [71]:
'Pre-Algebra' in course_catalog.items()

False

In [72]:
course_catalog.keys()

dict_keys(['AST101', 'MATH101', 'ENG304'])

In [73]:
course_catalog.values()

dict_values(['Astronomy 101: The Solar System', 'Pre-Algebra', 'Shakespeare and Donne'])

In [75]:
course_catalog.items()

dict_items([('AST101', 'Astronomy 101: The Solar System'), ('MATH101', 'Pre-Algebra'), ('ENG304', 'Shakespeare and Donne')])

In [76]:
df.head()

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status,Amount
0,714466,Trantow-Barrows,Craig Booker,Debra Henley,CPU,1,30000,presented,30000
1,714466,Trantow-Barrows,Craig Booker,Debra Henley,Software,1,10000,presented,10000
2,714466,Trantow-Barrows,Craig Booker,Debra Henley,Maintenance,2,5000,pending,10000
3,737550,"Fritsch, Russel and Anderson",Craig Booker,Debra Henley,CPU,1,35000,declined,35000
4,146832,Kiehn-Spinka,Daniel Hilton,Debra Henley,CPU,2,65000,won,130000


In [81]:
def double_val(value):
    return value * 2
    

df["high_quantity"] = df["Quantity"].apply(double_val)


df["high_quantity"] = df["Quantity"].apply(lambda val: val * 2)

In [82]:
df.head()

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status,Amount,high_quantity
0,714466,Trantow-Barrows,Craig Booker,Debra Henley,CPU,1,30000,presented,30000,2
1,714466,Trantow-Barrows,Craig Booker,Debra Henley,Software,1,10000,presented,10000,2
2,714466,Trantow-Barrows,Craig Booker,Debra Henley,Maintenance,2,5000,pending,10000,4
3,737550,"Fritsch, Russel and Anderson",Craig Booker,Debra Henley,CPU,1,35000,declined,35000,2
4,146832,Kiehn-Spinka,Daniel Hilton,Debra Henley,CPU,2,65000,won,130000,4


In [83]:
CLIENT_CATEGORY_MAP = {
    'Trantow-Barrows': 'Accounting',
    'Fritsch, Russel and Anderson': 'Legal',
    'Kiehn-Spinka': 'Manufacturing',
    'Kulas Inc': 'Manufacturing',
    'Jerde-Hilpert': 'Accounting',
    'Barton LLC': 'Enterprise',
    'Herman LLC':  'Enterprise',
    'Purdy-Kunde': 'Legal',
    'Stokes LLC': 'Enterprise',
    'Kassulke, Ondricka and Metz': "Legal",
    'Koepp Ltd': 'Shipping',
    'Keeling LLC': "Enterprise"
    
}

**Let's create a new column based on the above mapping and call it "Client_Category**

In [84]:
df["Client_Category"] = df["Name"].apply(lambda val: CLIENT_CATEGORY_MAP[val])

In [85]:
df.head()

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status,Amount,high_quantity,Client_Category
0,714466,Trantow-Barrows,Craig Booker,Debra Henley,CPU,1,30000,presented,30000,2,Accounting
1,714466,Trantow-Barrows,Craig Booker,Debra Henley,Software,1,10000,presented,10000,2,Accounting
2,714466,Trantow-Barrows,Craig Booker,Debra Henley,Maintenance,2,5000,pending,10000,4,Accounting
3,737550,"Fritsch, Russel and Anderson",Craig Booker,Debra Henley,CPU,1,35000,declined,35000,2,Legal
4,146832,Kiehn-Spinka,Daniel Hilton,Debra Henley,CPU,2,65000,won,130000,4,Manufacturing


**Now let's work with a different data set and keep practicing data munging!**

In [86]:
df = pd.read_csv(
    "https://raw.githubusercontent.com/suneel0101/lesson-plan/master/crunchbase_monthly_export.csv",
)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 15: invalid start byte

**What happened?**

In [87]:
df = pd.read_csv(
    "https://raw.githubusercontent.com/suneel0101/lesson-plan/master/crunchbase_monthly_export.csv",
    encoding='ISO-8859-1'
)

In [88]:
df.head()

Unnamed: 0,permalink,name,homepage_url,category_list,market,funding_total_usd,status,country_code,state_code,region,city,funding_rounds,founded_at,founded_month,founded_quarter,founded_year,first_funding_at,last_funding_at,Unnamed: 18
0,/organization/canal-do-credito,Canal do Credito,http://www.canaldocredito.com.br,|Credit|Technology|Services|Finance|,Credit,750000,,BRA,,Rio de Janeiro,Belo Horizonte,1,,,,,1/1/10,1/1/10,
1,/organization/waywire,#waywire,http://www.waywire.com,|Entertainment|Politics|Social Media|News|,Entertainment,1750000,acquired,USA,NY,New York City,New York,1,6/1/12,2012-06,2012-Q2,2012.0,6/30/12,6/30/12,
2,/organization/tv-communications,&TV Communications,http://enjoyandtv.com,|Games|,Games,4000000,operating,USA,CA,Los Angeles,Los Angeles,2,,,,,6/4/10,9/23/10,
3,/organization/rock-your-paper,'Rock' Your Paper,http://www.rockyourpaper.org,|Publishing|Education|,Education,40000,operating,EST,,Tallinn,Tallinn,1,10/26/12,2012-10,2012-Q4,2012.0,8/9/12,8/9/12,
4,/organization/in-touch-network,(In)Touch Network,http://www.InTouchNetwork.com,|Electronics|Guides|Coffee|Restaurants|Music|i...,Apps,1500000,operating,GBR,,London,London,1,4/1/11,2011-04,2011-Q2,2011.0,4/1/11,4/1/11,


In [31]:
# Any issues?
df.dtypes

permalink               object
name                    object
homepage_url            object
category_list           object
 market                 object
 funding_total_usd      object
status                  object
country_code            object
state_code              object
region                  object
city                    object
funding_rounds           int64
founded_at              object
founded_month           object
founded_quarter         object
founded_year           float64
first_funding_at        object
last_funding_at         object
Unnamed: 18            float64
dtype: object

In [90]:
df.columns

Index(['permalink', 'name', 'homepage_url', 'category_list', ' market ',
       ' funding_total_usd ', 'status', 'country_code', 'state_code', 'region',
       'city', 'funding_rounds', 'founded_at', 'founded_month',
       'founded_quarter', 'founded_year', 'first_funding_at',
       'last_funding_at', 'Unnamed: 18'],
      dtype='object')

In [92]:
df.columns  = ['permalink', 'name', 'homepage_url', 'category_list', 'market',
       'funding_total_usd', 'status', 'country_code', 'state_code', 'region',
       'city', 'funding_rounds', 'founded_at', 'founded_month',
       'founded_quarter', 'founded_year', 'first_funding_at',
       'last_funding_at', 'Unnamed: 18']

## Exercise
- Get descriptive statistics
- Look at datatypes, any issues?
- Get unique values for each of the column and see what they are (in particular category_list, market, status, and region)

## Exercise
Write a function that takes a value like what we see in `funding_total_usd` and returns a numeric value. Call it `transform_funding_total`, for exampe:

```python
def transform_funding_total(value):
    # some logic
    # transformed_value = ...
    return transformed_value
```

So that when it's called, it does the following:

```
>>> transform_funding_total('1,230,200')
1230200
```

In [97]:
def clean_total(val):
    try:
        return float(val.replace(",", ""))
    except ValueError:
        pass

df['clean_funding_total'] = df['funding_total_usd'].apply(clean_total)

In [99]:
df['clean_funding_total'].describe()

count    3.754400e+04
mean     1.800507e+07
std      4.417086e+08
min      3.500000e+01
25%      3.859800e+05
50%      2.000000e+06
75%      1.000000e+07
max      7.879506e+10
Name: clean_funding_total, dtype: float64

**Now let's apply it and transform the column in question**

## Exercise (hint: Pivots)
- What is the average number of funding rounds for companies in NYC?

## Exercise
What are the top 3 markets with the highest average funding total per company?

In [102]:
p_df = pd.pivot_table(df, index=['market'], values=['clean_funding_total'], fill_value=0)

In [108]:
p_df.sort_values('clean_funding_total', axis=0, ascending=False)[:3]

Unnamed: 0_level_0,clean_funding_total
market,Unnamed: 1_level_1
Oil and Gas,824257670.0
Local Commerce,440500000.0
Natural Gas Uses,400000000.0
