# Quiz: Exploratory Data Analysis

Read `online_bl.csv` from  `data_input` folder to answer **question 1-3**. You may find it helpful to use `parse_dates=[__]` while calling the `read_csv()` function. The data are all items listed for sale on the popular e-commerce website bukalapak.com within a specific set of categories. 

Perform  necessary data preparation steps and use the exploratory data analysis techniques you've acquired to answer the questions.


*Gunakan data `online_bl.csv` untuk menjawab pertanyaan berikut. Data yang digunakan adalah daftar barang-barang yang dijual dari situs e-commerce Bukalapak. Lakukan tahapan data preparation dan teknik exploratory data analysis yang telah Anda pelajari untuk menjawab setiap pertanyaan.*


In [2]:
## Read data & import library
import pandas as pd

data = pd.read_csv('data_input/online_bl.csv',parse_dates=['time_update'])

In [3]:
data.head()

Unnamed: 0,item_link,title,price_original,price_discount,sub_category,time_update,scale
0,https://www.bukalapak.com/p/kesehatan-2359/pro...,Rinso Molto Deterjen Bubuk 1.8 kg,30000.0,,detergent,2018-10-20 01:32:00,1.8 kg
1,https://www.bukalapak.com/p/rumah-tangga/home-...,Terlaris - DETERGENT RINSO ANTI NODA 1.8 KG 1 ...,49000.0,,detergent,2018-09-20 01:02:00,1.8 kg
2,https://www.bukalapak.com/p/rumah-tangga/home-...,Good Rinso Molto Purple 1.8 Kg,50000.0,,detergent,2018-10-13 10:46:00,1.8 kg
3,https://www.bukalapak.com/p/rumah-tangga/home-...,Order Rinso Molto Purple 1.8 Kg,49000.0,,detergent,2018-09-24 15:17:00,1.8 kg
4,https://www.bukalapak.com/p/rumah-tangga/home-...,Promonya Rinso Molto Purple 1.8 Kg,49000.0,,detergent,2018-09-27 11:16:00,1.8 kg


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 744 entries, 0 to 743
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   item_link       744 non-null    object        
 1   title           744 non-null    object        
 2   price_original  728 non-null    float64       
 3   price_discount  17 non-null     float64       
 4   sub_category    744 non-null    object        
 5   time_update     744 non-null    datetime64[ns]
 6   scale           744 non-null    object        
dtypes: datetime64[ns](1), float64(2), object(4)
memory usage: 40.8+ KB


In [5]:
data[['sub_category','scale']] = data[['sub_category','scale']].astype('category')
data.dtypes

item_link                 object
title                     object
price_original           float64
price_discount           float64
sub_category            category
time_update       datetime64[ns]
scale                   category
dtype: object

1. How many unique sub categories are there in `online_bl` dataset? Do we have more "detergent" listings or "sugar" listings within our data?

    *Berapa banyak sub kategori (`sub_categories`) unik yang ada dalam kumpulan data `online_bl`? Apakah kita memiliki lebih banyak daftar "deterjen" atau "gula" pada data tersebut?*

    - [ ] 2, with more "detergent" than "sugar"
    - [ ] 2, with "detergent" and "sugar" having equal listings
    - [x] 3, with more "sugar" than detergent
    - [ ] None of above is correct

In [6]:
data['sub_category'].value_counts()

rice         425
sugar        213
detergent    106
Name: sub_category, dtype: int64

2. In which scale do we have our **detergent** stock the most?

    *Deterjen yang ada di pasaran memiliki beberapa pilihan ukuran (1kg, 1.8kg, dsb.). Deterjen dengan ukuran berapakah yang paling banyak dijual di situs Bukalapak?* 

    - [ ] 1 kg
    - [x] 1.8 kg
    - [ ] 5 kg
    - [ ] 800 gr

In [15]:
detergent = data[data['sub_category']=='detergent'].copy()

In [16]:
pd.crosstab(index = detergent['sub_category'],
            columns = detergent['scale'])

scale,1.8 kg,800 gr
sub_category,Unnamed: 1_level_1,Unnamed: 2_level_1
detergent,88,18


3. Which month has the **lowest average price** (`mean` on `price_original`) for detergent products (1.8kg and 800gr respectively) listed for sale on Bukalapak? Are they the same month?

    *Di bulan apakah produk deterjen dengan ukuran 1,8 kg dan 800 gram berada di harga terendah? Apakah keduanya berada di bulan yang sama?*

    - [x] Both 1.8 kg and 800 gr detergents lowest price were in August
    - [ ] Both 1.8 kg and 800 gr detergents lowest price were in October
    - [ ] 1.8 kg detergents: Lowest in August, 800 gr: Lowest in October
    - [ ] 1.8 kg detergents: Lowest in August, 800 gr: Lowest in July   

In [17]:
detergent['month'] = detergent['time_update'].dt.month_name()

In [18]:
detergent_price = pd.crosstab(index = detergent['month'],
            columns = detergent['scale'],
            values = detergent['price_original'],
            aggfunc = 'mean').round(2)
detergent_price

scale,1.8 kg,800 gr
month,Unnamed: 1_level_1,Unnamed: 2_level_1
August,31000.0,20000.0
July,40000.0,30000.0
October,41191.84,21945.45
September,42750.0,33475.0


In [19]:
detergent_price.idxmin()

scale
1.8 kg    August
800 gr    August
dtype: object

---

Read `techcrunch.csv` from `data_input` folder to aswer **question 4-6**, a dataset that stores fundraising rounds and amounts from startup companies of different categories around the US.

*Gunakan data `techcrunch.csv` dari folder `data_input` untuk menjawab pertanyaan 4-6. Data tersebut menyimpan data fundraising yang diperoleh berbagai perusahaan startup dari berbagai bidang di Amerika.*

In [25]:
## Your code here
tc = pd.read_csv('data_input/techcrunch.csv',parse_dates=['fundedDate'])
tc.head()

Unnamed: 0,permalink,company,numEmps,category,city,state,fundedDate,raisedAmt,raisedCurrency,round
0,lifelock,LifeLock,,web,Tempe,AZ,2007-05-01,6850000,USD,b
1,lifelock,LifeLock,,web,Tempe,AZ,2006-10-01,6000000,USD,a
2,lifelock,LifeLock,,web,Tempe,AZ,2008-01-01,25000000,USD,c
3,mycityfaces,MyCityFaces,7.0,web,Scottsdale,AZ,2008-01-01,50000,USD,seed
4,flypaper,Flypaper,,web,Phoenix,AZ,2008-02-01,3000000,USD,a


In [26]:
tc.dtypes

permalink                 object
company                   object
numEmps                  float64
category                  object
city                      object
state                     object
fundedDate        datetime64[ns]
raisedAmt                  int64
raisedCurrency            object
round                     object
dtype: object

In [27]:
tc[['category','city','state','round']] = tc[['category','city','state','round']].astype('category')
tc.dtypes

permalink                 object
company                   object
numEmps                  float64
category                category
city                    category
state                   category
fundedDate        datetime64[ns]
raisedAmt                  int64
raisedCurrency            object
round                   category
dtype: object

4. Using `techcrunch.csv`, which `category` raised the most amount in funding (`raisedAmt`) on average (use the `median`)?

    *Berdasarkan data `techcrunch`, kategori (`category`) startup manakah yang mendapatkan rata-rata (gunakan `median`) funding (`raisedAmt`) tertinggi?*
    
    - [ ] `mobile`
    - [ ] `cleantech`
    - [x] `biotech`
    - [ ] `consulting`

In [52]:
tc.pivot_table(
    index='category',
    values='raisedAmt',
    aggfunc='median'
).round(2).sort_values(by='raisedAmt',ascending=False).head()

Unnamed: 0_level_0,raisedAmt
category,Unnamed: 1_level_1
biotech,20000000
cleantech,15500000
hardware,13700000
other,7750000
software,7125000


5. In which period does Friendster gain their highest raised amount of funding?

   *Pada periode manakah Friendster mendapatkan nilai funding tertinggi mereka?*
   
    - [x] 2008-08
    - [ ] 2002-12
    - [ ] 2006-08
    - [ ] 2012-01

In [28]:
friendster = tc[tc['company']=='Friendster'].copy()
friendster['yearmonth'] = friendster['fundedDate'].dt.to_period('M')

In [29]:
friendster

Unnamed: 0,permalink,company,numEmps,category,city,state,fundedDate,raisedAmt,raisedCurrency,round,yearmonth
318,friendster,Friendster,465.0,web,San Francisco,CA,2002-12-01,2400000,USD,a,2002-12
319,friendster,Friendster,465.0,web,San Francisco,CA,2003-10-01,13000000,USD,b,2003-10
320,friendster,Friendster,465.0,web,San Francisco,CA,2006-08-01,10000000,USD,c,2006-08
321,friendster,Friendster,465.0,web,San Francisco,CA,2008-08-05,20000000,USD,d,2008-08


In [30]:
friendster[['yearmonth','raisedAmt']][friendster['raisedAmt'] == friendster['raisedAmt'].max()]

Unnamed: 0,yearmonth,raisedAmt
321,2008-08,20000000


6.  Among all companies in San Francisco, which of the following are **not** among the top 5 most funded ( has highest **total** `raisedAmt`) companies? 

    *Perusahaan apa yang **TIDAK** termasuk 5 perusahaan dengan total funding (`raisedAmt`) tertinggi di San Francisco?*
    
    - [ ] `OpenTable`
    - [ ] `Friendster`
    - [ ] `Facebook`
    - [ ] `Snapfish`
  

In [31]:
sf = tc[tc['city']=='San Francisco'].copy()
sf = sf[['company','raisedAmt']]

pake groupby

In [32]:
sf_total_fund_per_company = sf.groupby('company').sum()

In [35]:
sf_total_fund_per_company.sort_values(by=['raisedAmt'], ascending=False).head()

Unnamed: 0_level_0,raisedAmt
company,Unnamed: 1_level_1
Slide,58000000
freebase,57500000
OpenTable,48000000
Friendster,45400000
Snapfish,43500000


pake pivot_table

In [None]:
sf = tc[tc['city']=='San Francisco'].copy()
sf = sf[['company','raisedAmt']]

In [63]:
sf.pivot_table(
    index='company',
    values='raisedAmt',
    aggfunc='sum'
).round(2).sort_values('raisedAmt',ascending=False).head()

Unnamed: 0_level_0,raisedAmt
company,Unnamed: 1_level_1
Slide,58000000
freebase,57500000
OpenTable,48000000
Friendster,45400000
Snapfish,43500000


In [46]:
tc[tc['company']=='Facebook']['city'] # Facebook berada di Palo Alto bukan San Fransisco

11    Palo Alto
12    Palo Alto
13    Palo Alto
14    Palo Alto
15    Palo Alto
16    Palo Alto
17    Palo Alto
Name: city, dtype: category
Categories (193, object): ['Acton', 'Agoura Hills', 'Alameda', 'Albuquerque', ..., 'Westwood', 'Winston-Salem', 'Woburn', 'Woodside']