# Quiz: Exploratory Data Analysis

Congratulations on completing the Exploratory Data Analysis course! We will conduct an assessment quiz to test your analytical thinking to explore data that you have learned on the course. The quiz is expected to be taken in the classroom, please contact our teaching team if you missed the chance to take it in class.

# Bukalapak Dataset

## Data Preparation


We will use **e-commerce product dataset**. You can use the data in the csv file extension stored in the `online_bl.csv` file in `data_input` folder. 

The data contain information on products sold on the e-commerce website Bukalapak.com. The data has several variables, including: 

- `item_link` : product website link in the list
- `title` : the name of the product being sold
- `price_original` : product price
- `price_discount` : product discount price
- `sub_category` : sub-category product
- `time_update` : time to upload product information on the website
- `scale` : product unit scale 

Please import `online_bl.csv` dataset from `data_input` folder and assign it into `online_bl` variable. As our dataset has datetime information, please use `parse_dates=[]` in `read_csv()` method to convert `time_update` column into datetime data type and store it into `online_bl`. 


In [3]:
import pandas as pd

In [30]:
## Import Library & Read Data

online_bl = pd.read_csv('data_input/online_bl.csv', parse_dates=['time_update'])
online_bl.head()

Unnamed: 0,item_link,title,price_original,price_discount,sub_category,time_update,scale
0,https://www.bukalapak.com/p/kesehatan-2359/pro...,Rinso Molto Deterjen Bubuk 1.8 kg,30000.0,,detergent,2018-10-20 01:32:00,1.8 kg
1,https://www.bukalapak.com/p/rumah-tangga/home-...,Terlaris - DETERGENT RINSO ANTI NODA 1.8 KG 1 ...,49000.0,,detergent,2018-09-20 01:02:00,1.8 kg
2,https://www.bukalapak.com/p/rumah-tangga/home-...,Good Rinso Molto Purple 1.8 Kg,50000.0,,detergent,2018-10-13 10:46:00,1.8 kg
3,https://www.bukalapak.com/p/rumah-tangga/home-...,Order Rinso Molto Purple 1.8 Kg,49000.0,,detergent,2018-09-24 15:17:00,1.8 kg
4,https://www.bukalapak.com/p/rumah-tangga/home-...,Promonya Rinso Molto Purple 1.8 Kg,49000.0,,detergent,2018-09-27 11:16:00,1.8 kg


Based on `online_bl` dataset you will perform data exploration to ensure it is ready for analysis. The first thing you will do is data type checking. 

In [15]:
# your code here
online_bl.dtypes

item_link                 object
title                     object
price_original           float64
price_discount           float64
sub_category            category
time_update       datetime64[ns]
scale                     object
dtype: object

As we know, `sub_category` column doesn't have appropriate data type. Please change it into the appropriate data type. 

In [14]:
# your code here
online_bl['sub_category'] = online_bl['sub_category'].astype('category')

## Analysis

In the `online_bl` dataset stores several categories sold in e-commerce. You are asked to analyze the data and answer a number of questions.

### Product Categories

You want to find out what sub categories (`sub_category`) are being sold. You will find out what categories is mostly sold in those e-commerce. Using the information from the `sub_category` column, please answer the questions below.

1. How many unique sub categories(`sub_category`) are there in `online_bl` dataset? Do we have more "detergent" listings or "sugar" listings within our data?

    *Berapa banyak jenis barang (`sub_category`) unik yang ada dalam kumpulan data `online_bl`? Apakah kita memiliki lebih banyak daftar "Detergent" atau "Sugar" pada data tersebut?*

    - [ ] 2, with more "detergent" than "sugar"
    - [ ] 2, with "detergent" and "sugar" having equal listings
    - [x] 3, with more "sugar" than detergent
    - [ ] None of above is correct

In [16]:
online_bl['sub_category'].cat.categories

Index(['detergent', 'rice', 'sugar'], dtype='object')

In [22]:
online_bl[online_bl['sub_category'] == 'detergent']

Unnamed: 0,item_link,title,price_original,price_discount,sub_category,time_update,scale
0,https://www.bukalapak.com/p/kesehatan-2359/pro...,Rinso Molto Deterjen Bubuk 1.8 kg,30000.0,,detergent,2018-10-20 01:32:00,1.8 kg
1,https://www.bukalapak.com/p/rumah-tangga/home-...,Terlaris - DETERGENT RINSO ANTI NODA 1.8 KG 1 ...,49000.0,,detergent,2018-09-20 01:02:00,1.8 kg
2,https://www.bukalapak.com/p/rumah-tangga/home-...,Good Rinso Molto Purple 1.8 Kg,50000.0,,detergent,2018-10-13 10:46:00,1.8 kg
3,https://www.bukalapak.com/p/rumah-tangga/home-...,Order Rinso Molto Purple 1.8 Kg,49000.0,,detergent,2018-09-24 15:17:00,1.8 kg
4,https://www.bukalapak.com/p/rumah-tangga/home-...,Promonya Rinso Molto Purple 1.8 Kg,49000.0,,detergent,2018-09-27 11:16:00,1.8 kg
...,...,...,...,...,...,...,...
101,https://www.bukalapak.com/p/rumah-tangga/perle...,Rinso Anti Noda 1.8 kg,27900.0,,detergent,2018-10-10 01:37:00,1.8 kg
102,https://www.bukalapak.com/p/rumah-tangga/perle...,RINSO ANTI NODA 1.8 kg,30500.0,,detergent,2018-10-05 11:53:00,1.8 kg
103,https://www.bukalapak.com/p/rumah-tangga/perle...,Rinso Anti Noda 1.8 kg,28500.0,,detergent,2018-10-20 09:31:00,1.8 kg
104,https://www.bukalapak.com/p/rumah-tangga/perle...,Rinso anti noda 1.8 kg,27000.0,,detergent,2018-10-17 03:00:00,1.8 kg


### Product Scales

Based on the several sub categories sold above, each item is sold in several size based on its weight, including detergent. Detergents on the market have several scale options (1kg, 1.8kg, etc.). 

2. In which scale do we have our **detergent** stock the most?

    *Deterjen dengan ukuran berapakah yang paling banyak dijual di situs Bukalapak?* 

    - [ ] 1 kg
    - [x] 1.8 kg
    - [ ] 5 kg
    - [ ] 800 gr

In [27]:
online_bl.dtypes

item_link                 object
title                     object
price_original           float64
price_discount           float64
sub_category            category
time_update       datetime64[ns]
scale                     object
dtype: object

In [32]:
online_bl['scale'].value_counts()

1 kg      364
5 kg      274
1.8 kg     88
800 gr     18
Name: scale, dtype: int64

In [35]:
# cara 1
pd.pivot_table(
    data=online_bl,
    index='scale',
    columns='sub_category',
    values='item_link',
    aggfunc='count'
)

sub_category,detergent,rice,sugar
scale,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1 kg,,212.0,152.0
1.8 kg,88.0,,
5 kg,,213.0,61.0
800 gr,18.0,,


Suddenly, you are in need of detergent. Based on the detergent scale information and the market price, you are interested in buying a detergent with scales of 1.8 kg and 800 grams. However, at this time you want to know what month is the detergent on that scales is sold at the lowest average price. 

3. Which month has the **lowest average price** (`mean` on `price_original`) for detergent products (1.8kg and 800gr respectively) listed for sale on Bukalapak? Are they the same month?

    *Di bulan apakah produk deterjen dengan ukuran 1,8 kg dan 800 gram berada di rata-rata harga terendah? Apakah keduanya berada di bulan yang sama?*

    - [ ] Both 1.8 kg and 800 gr detergents lowest price were in August
    - [ ] Both 1.8 kg and 800 gr detergents lowest price were in October
    - [x] 1.8 kg detergents: Lowest in August, 800 gr: Lowest in October
    - [ ] 1.8 kg detergents: Lowest in August, 800 gr: Lowest in July   

In [42]:
online_bl['time_update2'] = online_bl['time_update'].dt.month_name()

In [43]:
online_bl['time_update2']

0        October
1      September
2        October
3      September
4      September
         ...    
739      October
740      October
741      October
742      October
743      October
Name: time_update2, Length: 744, dtype: object

In [74]:
pd.pivot_table(
    data=online_bl,
    index=['time_update2','scale'],
    columns='sub_category',
    values='price_original',
    aggfunc='mean'
)

Unnamed: 0_level_0,sub_category,detergent,rice,sugar
time_update2,scale,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
August,1 kg,,73465.0,40700.0
August,1.8 kg,31000.0,,
August,5 kg,,126052.941176,90000.0
August,800 gr,20000.0,,
January,1 kg,,,18500.0
July,1 kg,,76800.0,21028.571429
July,1.8 kg,40000.0,,
July,5 kg,,73250.0,221000.0
July,800 gr,30000.0,,
June,1 kg,,110333.333333,51900.0


In [None]:
techcrunch[techcrunch['company'] == 'Friendster']

In [75]:
pd.pivot_table(
    data=online_bl,
    index=['company'] == 'Friendster,
    columns='sub_category',
    values='price_original',
    aggfunc='mean'
)

SyntaxError: EOL while scanning string literal (3118336055.py, line 3)

---

# Fund Raising Dataset

## Data Preparation

In the second analysis, you will use the **fund raising** dataset obtained by several startup companies in America. Please use `techcrunch.csv` data from `data_input` folder. The dataset contains the following variables:

- `permalink` : name of permalink company
- `company` : company name (company)
- `numEmps` : number of media partners
- `category` : company category
- `city` : the name of the city where the company is located
- `state` : state code of company location
- `fundedDate` : funding date
- `raisedAmt` : the amount of funding obtained
- `raisedCurrency` : information 

In [58]:
## Your code here

techcrunch = pd.read_csv('data_input/techcrunch.csv')
techcrunch

Unnamed: 0,permalink,company,numEmps,category,city,state,fundedDate,raisedAmt,raisedCurrency,round
0,lifelock,LifeLock,,web,Tempe,AZ,1-May-07,6850000,USD,b
1,lifelock,LifeLock,,web,Tempe,AZ,1-Oct-06,6000000,USD,a
2,lifelock,LifeLock,,web,Tempe,AZ,1-Jan-08,25000000,USD,c
3,mycityfaces,MyCityFaces,7.0,web,Scottsdale,AZ,1-Jan-08,50000,USD,seed
4,flypaper,Flypaper,,web,Phoenix,AZ,1-Feb-08,3000000,USD,a
...,...,...,...,...,...,...,...,...,...,...
1455,trusera,Trusera,15.0,web,Seattle,WA,1-Jun-07,2000000,USD,angel
1456,alerts-com,Alerts.com,,web,Bellevue,WA,8-Jul-08,1200000,USD,a
1457,myrio,Myrio,75.0,software,Bothell,WA,1-Jan-01,20500000,USD,unattributed
1458,grid-networks,Grid Networks,,web,Seattle,WA,30-Oct-07,9500000,USD,a


In [57]:
techcrunch.dtypes

permalink           object
company             object
numEmps            float64
category          category
city                object
state               object
fundedDate          object
raisedAmt            int64
raisedCurrency    category
round             category
dtype: object

In [55]:
techcrunch.nunique()

permalink         909
company           909
numEmps            72
category            8
city              193
state              33
fundedDate        386
raisedAmt         281
raisedCurrency      3
round               9
dtype: int64

Before exploring further data, please adjust some of the columns that don't have the appropriate data type in order to reduce memory. 

In [56]:
## Your code here
cats = ['category','raisedCurrency','round']
techcrunch[cats] = techcrunch[cats].astype('category')


## Analysis

### Funding each Category

As someone who wants to run a startup, you want to do a fairly thorough funding plan, so that your company runs well. Therefore, you are interested in finding out which startup `category` gets the highest funding. Since there are many startups working in the same field, you will want to get a summary of the average amount of funding (`raisedAmt`) given. As you already know, the average value will be affected by outliers, so you will use the median value to get a summary of the startup fields that get the highest funding.

Based on the conditions, answer the questions below. 

4. Which `category` raised the most amount in funding (`raisedAmt`) on average (use the `median`)?

    *Kategori (`category`) startup manakah yang mendapatkan rata-rata (gunakan `median`) funding (`raisedAmt`) tertinggi?*
    
    - [ ] `mobile`
    - [ ] `cleantech`
    - [x] `biotech`
    - [ ] `consulting`

In [61]:
## Your code here
pd.pivot_table(
    data=techcrunch,
    index='category',
    values='raisedAmt',
    aggfunc='median'
)


Unnamed: 0_level_0,raisedAmt
category,Unnamed: 1_level_1
biotech,20000000
cleantech,15500000
consulting,7000000
hardware,13700000
mobile,5000000
other,7750000
software,7125000
web,5000000


### Funding each Company

As a social media user, you are interested in analyzing one of the social media that is included in the list of startups receiving funding, namely **Friendster**. During the funding period, Friendster always gain different amount of funding. 

5. In which period does Friendster gain their highest raised amount of funding?

   *Pada periode manakah Friendster mendapatkan nilai funding tertinggi mereka?*
   
    - [x] 2008-08
    - [ ] 2002-12
    - [ ] 2006-08
    - [ ] 2012-01

In [65]:
techcrunch[techcrunch['company'] == 'Friendster']

Unnamed: 0,permalink,company,numEmps,category,city,state,fundedDate,raisedAmt,raisedCurrency,round
318,friendster,Friendster,465.0,web,San Francisco,CA,1-Dec-02,2400000,USD,a
319,friendster,Friendster,465.0,web,San Francisco,CA,1-Oct-03,13000000,USD,b
320,friendster,Friendster,465.0,web,San Francisco,CA,1-Aug-06,10000000,USD,c
321,friendster,Friendster,465.0,web,San Francisco,CA,5-Aug-08,20000000,USD,d


In [88]:
city2 = techcrunch[techcrunch['city'] == 'San Francisco']

In [81]:
pd.pivot_table(
    data=techcrunch,
    index='category',
    values='raisedAmt',
    aggfunc='median'
)

Unnamed: 0_level_0,raisedAmt
category,Unnamed: 1_level_1
biotech,20000000
cleantech,15500000
consulting,7000000
hardware,13700000
mobile,5000000
other,7750000
software,7125000
web,5000000


After looking at several startups that have received funding, you want to find out more about startups that have successfully received funding in your location, **San Francisco**. Create an aggregation of data showing some of the highest to lowest funded companies in San Francisco. 

6.  Among all companies in San Francisco, which of the following are **not** among the top 5 most funded ( has highest **total** `raisedAmt`) companies? 

    *Perusahaan apa yang **TIDAK** termasuk 5 perusahaan dengan **total** funding (`raisedAmt`) tertinggi di San Francisco?*
    
    - [x] `OpenTable`
    - [ ] `Friendster`
    - [ ] `Facebook`
    - [x] `Snapfish`
  

In [146]:
city2.sort_values(by='raisedAmt', ascending=True).tail(20)

Unnamed: 0,permalink,company,numEmps,category,city,state,fundedDate,raisedAmt,raisedCurrency,round
304,mevio,Mevio,,web,San Francisco,CA,1-Sep-06,15000000,USD,b
305,mevio,Mevio,,web,San Francisco,CA,9-Jul-08,15000000,USD,c
206,tagged,Tagged,,web,San Francisco,CA,1-Jul-07,15000000,USD,b
523,playfirst,PlayFirst,,web,San Francisco,CA,1-Dec-07,16500000,USD,c
345,deliveryagent,Delivery Agent,,web,San Francisco,CA,1-May-07,18500000,USD,c
202,hi5,hi5,100.0,web,San Francisco,CA,1-Jul-07,20000000,USD,a
321,friendster,Friendster,465.0,web,San Francisco,CA,5-Aug-08,20000000,USD,d
65,prosper,Prosper,,web,San Francisco,CA,1-Jun-07,20000000,USD,c
162,healthline,Healthline,,web,San Francisco,CA,1-Jul-07,21000000,USD,b
615,coverity,Coverity,,software,San Francisco,CA,1-Feb-08,22000000,USD,a


In [126]:
pd.pivot_table(
    data=techcrunch,
    index='city',
    columns='company',
    values='raisedAmt',
    aggfunc='sum'
).loc[['San Francisco']]

company,23andMe,3Jam,4HomeMedia,5min,750 Industries,A123Systems,Accertify,AccountNow,Acinion,Acquia,...,samfind,seesmic,trueAnthem,tubemogul,uTest,uber,utoopia,vbs tv,x+1,xkoto
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
San Francisco,,,,,1000000.0,,,,,,...,,12000000.0,,,,,,,,


In [147]:
pd.crosstab(
    index=techcrunch['city'],
    columns=techcrunch['company'],
    values=techcrunch['raisedAmt'],
    aggfunc='sum')
# .loc[['San Francisco']]

company,23andMe,3Jam,4HomeMedia,5min,750 Industries,A123Systems,Accertify,AccountNow,Acinion,Acquia,...,samfind,seesmic,trueAnthem,tubemogul,uTest,uber,utoopia,vbs tv,x+1,xkoto
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Acton,,,,,,,,,21000000.0,,...,,,,,,,,,,
Agoura Hills,,,,,,,,,,,...,,,,,,,,,,
Alameda,,,,,,,,,,,...,,,,,,,,,,
Albuquerque,,,,,,,,,,,...,,,,,,,,,,
Aliso Viejo,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Westport,,,,,,,,,,,...,,,,,,,,,,
Westwood,,,,,,,,,,,...,,,,,,,,,,
Winston-Salem,,,,,,,,,,,...,,,,,,,,,,
Woburn,,,,,,,,,,,...,,,,,,,,,,


In [104]:
trx_ym_format

city,Acton,Agoura Hills,Alameda,Albuquerque,Aliso Viejo,Allentown,American Fork,Andover,Ann Arbor,Arlington,...,Wellesley,West Hollywood,West Palm Beach,Westborough,Westlake Village,Westport,Westwood,Winston-Salem,Woburn,Woodside
company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
23andMe,,,,,,,,,,,...,,,,,,,,,,
3Jam,,,,,,,,,,,...,,,,,,,,,,
4HomeMedia,,,,,,,,,,,...,,,,,,,,,,
5min,,,,,,,,,,,...,,,,,,,,,,
750 Industries,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
uber,,,,,,,,,,,...,,,,,,,,,,
utoopia,,,,,,,,,,,...,,,,,,,,,,
vbs tv,,,,,,,,,,,...,,,,,,,,,,
x+1,,,,,,,,,,,...,,,,,,,,,,
