# Quiz: Exploratory Data Analysis

Congratulations on completing the Exploratory Data Analysis course! We will conduct an assessment quiz to test your analytical thinking to explore data that you have learned on the course. The quiz is expected to be taken in the classroom, please contact our teaching team if you missed the chance to take it in class.

# Bukalapak Dataset

## Data Preparation


We will use **e-commerce product dataset**. You can use the data in the csv file extension stored in the `online_bl.csv` file in `data_input` folder. 

The data contain information on products sold on the e-commerce website Bukalapak.com. The data has several variables, including: 

- `item_link` : product website link in the list
- `title` : the name of the product being sold
- `price_original` : product price
- `price_discount` : product discount price
- `sub_category` : sub-category product
- `time_update` : time to upload product information on the website
- `scale` : product unit scale 

Please import `online_bl.csv` dataset from `data_input` folder and assign it into `online_bl` variable. As our dataset has datetime information, please use `parse_dates=[]` in `read_csv()` method to convert `time_update` column into datetime data type and store it into `online_bl`. 


In [2]:
## Import Library & Read Data
import pandas as pd
import numpy as np
print(pd.__version__)

# pandas output display setup
pd.set_option('display.float_format', lambda x: '%.2f' % x) 
pd.options.display.float_format = '{:,}'.format
online_bl = pd.read_csv("data_input/online_bl.csv", parse_dates=["time_update"])
online_bl.head()


2.2.2


Unnamed: 0,item_link,title,price_original,price_discount,sub_category,time_update,scale
0,https://www.bukalapak.com/p/kesehatan-2359/pro...,Rinso Molto Deterjen Bubuk 1.8 kg,30000.0,,detergent,2018-10-20 01:32:00,1.8 kg
1,https://www.bukalapak.com/p/rumah-tangga/home-...,Terlaris - DETERGENT RINSO ANTI NODA 1.8 KG 1 ...,49000.0,,detergent,2018-09-20 01:02:00,1.8 kg
2,https://www.bukalapak.com/p/rumah-tangga/home-...,Good Rinso Molto Purple 1.8 Kg,50000.0,,detergent,2018-10-13 10:46:00,1.8 kg
3,https://www.bukalapak.com/p/rumah-tangga/home-...,Order Rinso Molto Purple 1.8 Kg,49000.0,,detergent,2018-09-24 15:17:00,1.8 kg
4,https://www.bukalapak.com/p/rumah-tangga/home-...,Promonya Rinso Molto Purple 1.8 Kg,49000.0,,detergent,2018-09-27 11:16:00,1.8 kg


Based on `online_bl` dataset you will perform data exploration to ensure it is ready for analysis. The first thing you will do is data type checking. 

In [3]:
# your code here
online_bl.dtypes

item_link                 object
title                     object
price_original           float64
price_discount           float64
sub_category              object
time_update       datetime64[ns]
scale                     object
dtype: object

As we know, `sub_category` column doesn't have appropriate data type. Please change it into the appropriate data type. 

In [19]:
# your code here
online_bl["sub_category"] = online_bl["sub_category"].astype("category")
# online_bl["time_update"] = online_bl["time_update"].dt.to_period("M")
# 1 online_bl["sub_category"].value_counts()
# 2 online_bl.value_counts(["sub_category", "scale"])
# 3 online_bl.pivot_table(index=["scale", "time_update"], columns="sub_category", values="price_original", aggfunc="mean")

In [86]:
tc = pd.read_csv("data_input/techcrunch.csv", parse_dates=["fundedDate"])
# tc.head()
tc[["category"]] = tc[["category"]].astype("category")
tc["category"].nunique
# 4 tc.pivot_table(index="category", values="raisedAmt", aggfunc="median")
tc.dtypes
tc["fundedDate"] = tc["fundedDate"].dt.to_period("M")
tc.head()
# 5 tc.pivot_table(index="company", columns="fundedDate", values="raisedAmt", aggfunc="sum").query("company == 'Friendster'").dropna(axis=1)
# 6 tc[tc["city"] == "San Francisco"].pivot_table(index="company", values="raisedAmt", aggfunc="sum").sort_values("raisedAmt", ascending=False)

  tc = pd.read_csv("data_input/techcrunch.csv", parse_dates=["fundedDate"])


fundedDate,2002-12,2003-10,2006-08,2008-08
company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Friendster,2400000.0,13000000.0,10000000.0,20000000.0


## Analysis

In the `online_bl` dataset stores several categories sold in e-commerce. You are asked to analyze the data and answer a number of questions.

### Product Categories

You want to find out what sub categories (`sub_category`) are being sold. You will find out what categories is mostly sold in those e-commerce. Using the information from the `sub_category` column, please answer the questions below.

1. How many unique sub categories(`sub_category`) are there in `online_bl` dataset? Do we have more "detergent" listings or "sugar" listings within our data?

    *Berapa banyak jenis barang (`sub_category`) unik yang ada dalam kumpulan data `online_bl`? Apakah kita memiliki lebih banyak daftar "Detergent" atau "Sugar" pada data tersebut?*

    - [ ] 2, with more "detergent" than "sugar"
    - [ ] 2, with "detergent" and "sugar" having equal listings
    - [ ] 3, with more "sugar" than detergent
    - [ ] None of above is correct

### Product Scales

Based on the several sub categories sold above, each item is sold in several size based on its weight, including detergent. Detergents on the market have several scale options (1kg, 1.8kg, etc.). 

2. In which scale do we have our **detergent** stock the most?

    *Deterjen dengan ukuran berapakah yang paling banyak dijual di situs Bukalapak?* 

    - [ ] 1 kg
    - [ ] 1.8 kg
    - [ ] 5 kg
    - [ ] 800 gr

Suddenly, you are in need of detergent. Based on the detergent scale information and the market price, you are interested in buying a detergent with scales of 1.8 kg and 800 grams. However, at this time you want to know what month is the detergent on that scales is sold at the lowest average price. 

3. Which month has the **lowest average price** (`mean` on `price_original`) for detergent products (1.8kg and 800gr respectively) listed for sale on Bukalapak? Are they the same month?

    *Di bulan apakah produk deterjen dengan ukuran 1,8 kg dan 800 gram berada di rata-rata harga  (`price_original`) terendah? Apakah keduanya berada di bulan yang sama?*

    - [ ] Both 1.8 kg and 800 gr detergents lowest price were in August
    - [ ] Both 1.8 kg and 800 gr detergents lowest price were in October
    - [ ] 1.8 kg detergents: Lowest in August, 800 gr: Lowest in October
    - [ ] 1.8 kg detergents: Lowest in August, 800 gr: Lowest in July   

---

# Fund Raising Dataset

## Data Preparation

In the second analysis, you will use the **fund raising** dataset obtained by several startup companies in America. Please use `techcrunch.csv` data from `data_input` folder. The dataset contains the following variables:

- `permalink` : name of permalink company
- `company` : company name (company)
- `numEmps` : number of media partners
- `category` : company category
- `city` : the name of the city where the company is located
- `state` : state code of company location
- `fundedDate` : funding date
- `raisedAmt` : the amount of funding obtained
- `raisedCurrency` : information 

In [2]:
## Your code here



Before exploring further data, please adjust some of the columns that don't have the appropriate data type in order to reduce memory. 

In [1]:
## Your code here


## Analysis

### Funding each Category

As someone who wants to run a startup, you want to do a fairly thorough funding plan, so that your company runs well. Therefore, you are interested in finding out which startup `category` gets the highest funding. Since there are many startups working in the same field, you will want to get a summary of the average amount of funding (`raisedAmt`) given. As you already know, the average value will be affected by outliers, so you will use the median value to get a summary of the startup fields that get the highest funding.

Based on the conditions, answer the questions below. 

4. Which `category` raised the most amount in funding (`raisedAmt`) on average (use the `median`)?

    *Kategori (`category`) startup manakah yang mendapatkan rata-rata (gunakan `median`) funding (`raisedAmt`) tertinggi?*
    
    - [ ] `mobile`
    - [ ] `cleantech`
    - [ ] `biotech`
    - [ ] `consulting`

In [2]:
## Your code here



### Funding each Company

As a social media user, you are interested in analyzing one of the social media that is included in the list of startups receiving funding, namely **Friendster**. During the funding period, Friendster always gain different amount of funding. 

5. In which period does Friendster gain their highest raised amount of funding?

   *Pada periode manakah Friendster mendapatkan nilai funding tertinggi mereka?*
   
    - [ ] 2008-08
    - [ ] 2002-12
    - [ ] 2006-08
    - [ ] 2012-01

After looking at several startups that have received funding, you want to find out more about startups that have successfully received funding in your location, **San Francisco**. Create an aggregation of data showing some of the highest to lowest funded companies in San Francisco. 

6.  Among all companies in San Francisco, which of the following are **not** among the top 5 most funded ( has highest **total** `raisedAmt`) companies? 

    *Perusahaan apa yang **TIDAK** termasuk 5 perusahaan dengan **total** funding (`raisedAmt`) tertinggi di San Francisco?*
    
    - [ ] `OpenTable`
    - [ ] `Friendster`
    - [ ] `Facebook`
    - [ ] `Snapfish`
  