# Quiz: Python for Data Analysts

In [1]:
import pandas as pd

## Data Pre-Processing

The data you will read in is `companies.csv`, a small sample of a a larger CRM (customer relationship management) dataset.

In [2]:
clients = pd.read_csv("data/companies.csv", index_col=0)
clients

Unnamed: 0_level_0,Customer Name,Consulting Sales,Software Sales,Forecasted Growth,Returns,Month,Day,Year,Location,Account
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
30940,New Media Group,IDR7125000,IDR5500000,30.00%,"IDR1,500,000",1,10,2017,Jakarta,Enterprise
82391,Li and Partners,IDR420000,IDR820000,10.00%,"IDR400,000",6,15,2016,Jakarta,Startup
18374,PT. Kreasi Metrik Solusi,0,IDR550403,25.00%,0,3,29,2012,Surabaya,Enterprise
57531,PT. Algoritma Data Indonesia,IDR850000,IDR395500,4.00%,0,7,17,2017,Jakarta,Startup
19002,Palembang Konsultansi,IDR2115000,0,-15.00%,0,2,24,2018,Bandung,Startup
31142,PT. Surya Citra Manajemen,IDR960000,IDR503000,19.00%,0,1,19,2019,Jakarta,Enterprise


Unlike our previous datasets, `clients` has some formatting inconsistencies by design: The `Returns` column has comma delimiter and the currency (`IDR`) whereas related columns use values that has omitted the separator.

Now let's observe its data types:

In [3]:
clients.dtypes

Customer Name        object
Consulting Sales     object
Software Sales       object
Forecasted Growth    object
Returns              object
Month                 int64
Day                   int64
Year                  int64
Location             object
Account              object
dtype: object

Do you think they have stored as the right types? Can you apply what you have learnt about specifying data type on this new data?

In [4]:
# ## Your Code Below:
# clients[['Consulting Sales','Software Sales','Returns']] = clients[['Consulting Sales','Software Sales','Returns']].astype('int64')

If you tried to directly use the `.astype` function on `Consulting Sales` and `Software Sales`, you will most likely get an error.  To perform arithmetic computations on the numeric columns, we have to drop the 'IDR' currency string and treat these columns as numbers. We'll use pandas built-in `.str.replace()` method for this.

In [5]:
clients['Consulting Sales'].str.replace('IDR','')

ID
30940    7125000
82391     420000
18374          0
57531     850000
19002    2115000
31142     960000
Name: Consulting Sales, dtype: object

To apply the function on multiple columns, we can use `apply` method with `lambda` as below:

In [6]:
clients[['Consulting Sales','Software Sales','Returns']] =\
clients[['Consulting Sales','Software Sales','Returns']].apply(lambda x: x.str.replace('IDR',''))

In [7]:
clients.head()

Unnamed: 0_level_0,Customer Name,Consulting Sales,Software Sales,Forecasted Growth,Returns,Month,Day,Year,Location,Account
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
30940,New Media Group,7125000,5500000,30.00%,1500000,1,10,2017,Jakarta,Enterprise
82391,Li and Partners,420000,820000,10.00%,400000,6,15,2016,Jakarta,Startup
18374,PT. Kreasi Metrik Solusi,0,550403,25.00%,0,3,29,2012,Surabaya,Enterprise
57531,PT. Algoritma Data Indonesia,850000,395500,4.00%,0,7,17,2017,Jakarta,Startup
19002,Palembang Konsultansi,2115000,0,-15.00%,0,2,24,2018,Bandung,Startup


Go on fill in the blank below to remove the comma (`,`) sign on `Returns`!

In [8]:
## Fill in the blank (___):

clients['Returns'] = clients['Returns'].str.replace(',','')

In [9]:
clients.head()

Unnamed: 0_level_0,Customer Name,Consulting Sales,Software Sales,Forecasted Growth,Returns,Month,Day,Year,Location,Account
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
30940,New Media Group,7125000,5500000,30.00%,1500000,1,10,2017,Jakarta,Enterprise
82391,Li and Partners,420000,820000,10.00%,400000,6,15,2016,Jakarta,Startup
18374,PT. Kreasi Metrik Solusi,0,550403,25.00%,0,3,29,2012,Surabaya,Enterprise
57531,PT. Algoritma Data Indonesia,850000,395500,4.00%,0,7,17,2017,Jakarta,Startup
19002,Palembang Konsultansi,2115000,0,-15.00%,0,2,24,2018,Bandung,Startup


In [10]:
## Your Code Below:
clients[['Consulting Sales','Software Sales','Returns']] = clients[['Consulting Sales','Software Sales','Returns']].astype('int64')

---

1. Create a new column in the DataFrame and name it `Total Sales`. This column is a sum of `Consulting Sales` and `Software Sales`. Use `head` or `tail` to peek at the resulting data frame to confirm that the output matches your expectation. What is the sum of the `Total Sales` column?  

    *Anda diminta untuk mendapatkan total penjualan secara keseluruhan dengan mengakumulasikan nilai Consulting Sales dan Software Sales dari setiap perusahaan. Buatlah kolom baru bernama Total Sales yang menyimpan total dari kedua nilai tersebut. Berapa total (`sum`) keseluruhan dari `Total Sales`? Tips: Gunakan method `.sum()` pada kolom untuk mengakumulasi nilai totalnya!*

    - [ ] 11,470,000
    - [x] 19,238,903
    - [ ]  7,768,903

In [11]:
clients['Total Sales'] = clients['Consulting Sales'] + clients['Software Sales']
clients['Total Sales'] = clients['Total Sales'].astype('int64')

In [12]:
clients.head()

Unnamed: 0_level_0,Customer Name,Consulting Sales,Software Sales,Forecasted Growth,Returns,Month,Day,Year,Location,Account,Total Sales
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
30940,New Media Group,7125000,5500000,30.00%,1500000,1,10,2017,Jakarta,Enterprise,12625000
82391,Li and Partners,420000,820000,10.00%,400000,6,15,2016,Jakarta,Startup,1240000
18374,PT. Kreasi Metrik Solusi,0,550403,25.00%,0,3,29,2012,Surabaya,Enterprise,550403
57531,PT. Algoritma Data Indonesia,850000,395500,4.00%,0,7,17,2017,Jakarta,Startup,1245500
19002,Palembang Konsultansi,2115000,0,-15.00%,0,2,24,2018,Bandung,Startup,2115000


In [13]:
clients['Total Sales'].sum()

19238903

2. Which company has the biggest `Total Sales` in 2017?  

    *Perusahaan manakah yang mendapatkan Total Sales terbesar di tahun 2017? Gunakan metode subset yang telah Anda pelajari!*
    
    - [X] New Media Group
    - [ ] PT. Algoritma Data Indonesia
    - [ ] Palembang Konsultansi
    

In [17]:
clients[(clients['Year']==2017) & (clients['Total Sales']==clients['Total Sales'].max())]

Unnamed: 0_level_0,Customer Name,Consulting Sales,Software Sales,Forecasted Growth,Returns,Month,Day,Year,Location,Account,Total Sales
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
30940,New Media Group,7125000,5500000,30.00%,1500000,1,10,2017,Jakarta,Enterprise,12625000


3. Which are the two companies that has sales exceeding 1,500,000 IDR in the sampled data frame?  

    *Ada dua perusahaan yang nilai penjualannya melebihi 1,500,000 IDR pada data sample tersebut. Perusahaan mana sajakah itu?*

    - [ ] Palembang Konsultansi & PT. Surya Citra Manajemen
    - [ ] PT. Surya Citra Manajemen & New Media Group
    - [x] Palembang Konsultansi & New Media Group
    

In [20]:
clients[['Customer Name','Total Sales']][clients['Total Sales'] > 1500000]

Unnamed: 0_level_0,Customer Name,Total Sales
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
30940,New Media Group,12625000
19002,Palembang Konsultansi,2115000


4. The simplest way to ignore the outliers of sample data and find its central value is by using median instead of mean. By ignoring the outliers of `Total Sales`, what is its central value?  

    *Dalam menentukan rata-rata / pusat data, penggunaan median seringkali lebih relevan dibandingkan dengan mean, karena mean lebih mudah terpengaruh terhadap nilai-nilai ekstrim atau outlier. Jika kita tidak ingin mendapatkan pusat data yang terpengaruh outlier pada Total Sales, berapakah nilai pusat data yang kita gunakan?*
    
    - [x] 1,354,250
    - [ ] 1,515,875
    - [ ] 3,737,700

In [66]:
clients['Total Sales'].median()

1354250.0

5. If we want to perform subsetting on `clients` by explicitly stating the `ID`, which subsetting method is more appropriate?  

    *Jika kita ingin melakukan operasi subsetting pada data `clients` dimana dalam prosesnya kita mencantumkan ID perusahaan secara eksplisit, metode subsetting manakah yang paling sesuai?*

    - [X] `clients.loc[57531, :]`
    - [ ] `clients.iloc[57531, : ]`
    - [ ] `clients[57531, : ]` 

In [54]:
clients.loc[57531, :]

Customer Name        PT. Algoritma Data Indonesia
Consulting Sales                           850000
Software Sales                             395500
Forecasted Growth                           4.00%
Returns                                         0
Month                                           7
Day                                            17
Year                                         2017
Location                                  Jakarta
Account                                   Startup
Total Sales                               1245500
Name: 57531, dtype: object

In [89]:
clients[57531, : ]

TypeError: '(57531, slice(None, None, None))' is an invalid key

6. Say, we need to find clients of Enterprise account which specifically located in Jakarta. Try to fill in the blank codes to perform the right conditional subsetting:  `clients[________ _ ________]`:

    *Apabila kita ingin mendapatkan data dari client dengan akun Enterprise yang berlokasi di Jakarta, syntax subset yang akan kita gunakan adalah clients[________ _ ________]. (Isilah nilai di dalam kurung siku!)*

    - [ ] (clients.Location == "Jakarta") | (clients.Account == "Enterprise")
    - [ ] clients.Location == "Jakarta" & clients.Account == "Enterprise"
    - [x] (clients.Location == "Jakarta") & (clients.Account == "Enterprise")
    - [ ] clients.Location == "Jakarta" | clients.Account == "Enterprise"

In [59]:
clients[(clients.Location == "Jakarta") & (clients.Account == "Enterprise")]

Unnamed: 0_level_0,Customer Name,Consulting Sales,Software Sales,Forecasted Growth,Returns,Month,Day,Year,Location,Account,Total Sales
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
30940,New Media Group,7125000,5500000,30.00%,1500000,1,10,2017,Jakarta,Enterprise,12625000
31142,PT. Surya Citra Manajemen,960000,503000,19.00%,0,1,19,2019,Jakarta,Enterprise,1463000
