# Learn-by-Building

In [2]:
import pandas as pd

## Data Pre-Processing

The data you will read in is `companies.csv`, a small sample of a a larger CRM (customer relationship management) dataset.

In [3]:
clients = pd.read_csv("data/companies.csv", index_col=0)
clients.head()

Unnamed: 0_level_0,Customer Name,Consulting Sales,Software Sales,Forecasted Growth,Returns,Month,Day,Year,Location,Account
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
30940,New Media Group,IDR7125000,IDR5500000,30.00%,"IDR1,500,000",1,10,2017,Jakarta,Enterprise
82391,Li and Partners,IDR420000,IDR820000,10.00%,"IDR400,000",6,15,2016,Jakarta,Startup
18374,PT. Kreasi Metrik Solusi,0,IDR550403,25.00%,0,3,29,2012,Surabaya,Enterprise
57531,PT. Algoritma Data Indonesia,IDR850000,IDR395500,4.00%,0,7,17,2017,Jakarta,Startup
19002,Palembang Konsultansi,IDR2115000,0,-15.00%,0,2,24,2018,Bandung,Startup


Unlike our previous datasets, `clients` has some formatting inconsistencies by design: The `Returns` column has comma delimiter and the currency (`IDR`) whereas related columns use values that has omitted the separator.

Now let's observe its data types:

In [6]:
clients.dtypes

Customer Name        object
Consulting Sales     object
Software Sales       object
Forecasted Growth    object
Returns              object
Month                 int64
Day                   int64
Year                  int64
Location             object
Account              object
dtype: object

Do you think they have stored as the right types? Can you apply what you have learnt about specifying data type on this new data?

In [None]:
## Your Code Below:


If you tried to directly use the `.astype` function on `Consulting Sales` and `Software Sales`, you will most likely get an error.  To perform arithmetic computations on the numeric columns, we have to drop the 'IDR' currency string and treat these columns as numbers. We'll use the built-in `.replace()` method for this.

How do we apply that replace function? There are two ways of using this function that I'm going to show. If you're familiar with Regex syntax, the following method might be convenient for you:

In [11]:
clients[['Consulting Sales','Software Sales','Returns']]=\
clients[['Consulting Sales','Software Sales','Returns']].replace(
    '[^\d]+', '', 
    regex=True)
clients.head()

Unnamed: 0_level_0,Customer Name,Consulting Sales,Software Sales,Forecasted Growth,Returns,Month,Day,Year,Location,Account
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
30940,New Media Group,7125000,5500000,30.00%,1500000,1,10,2017,Jakarta,Enterprise
82391,Li and Partners,420000,820000,10.00%,400000,6,15,2016,Jakarta,Startup
18374,PT. Kreasi Metrik Solusi,0,550403,25.00%,0,3,29,2012,Surabaya,Enterprise
57531,PT. Algoritma Data Indonesia,850000,395500,4.00%,0,7,17,2017,Jakarta,Startup
19002,Palembang Konsultansi,2115000,0,-15.00%,0,2,24,2018,Bandung,Startup


Another way is to call `.apply(our_function)` on our `DataFrame`. What's interesting is that `our_function` could be any of `python` built-in functions, functions from third-party modules, or it could also be a list of functions. For example, if we want to get both maximum and minimum values of `Month`, `Day` and `Year`: 


In [12]:
clients[['Month', 'Day', 'Year']].apply([max, min])

Unnamed: 0,Month,Day,Year
max,7,29,2019
min,1,10,2012


Back to removing the currency string from `Consulting Sales` and `Software Sales` using `.apply()` and `.replace()`. We could create our own function , name it `removeIDR` for example and then apply it the following way:

`clients['Consulting Sales'].apply(removeIDR)`

Writing functions is a topic that is more suited for a later time, and students new to the trade of programming in Python will be gradually introduced to this aspect of Python programming.

However, given the task at hand, this seems like a reasonable time to introduce **Lambdas**.

In [13]:
clients['Consulting Sales'].replace('[^\d.]+','',regex=True)

ID
30940    7125000
82391     420000
18374          0
57531     850000
19002    2115000
31142     960000
Name: Consulting Sales, dtype: object

In [14]:
clients['Consulting Sales'] = clients['Consulting Sales'].apply(lambda x: x.replace('IDR', ''))
clients['Software Sales'] = clients['Software Sales'].apply(lambda x: x.replace('IDR', ''))
clients.head()

Unnamed: 0_level_0,Customer Name,Consulting Sales,Software Sales,Forecasted Growth,Returns,Month,Day,Year,Location,Account
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
30940,New Media Group,7125000,5500000,30.00%,1500000,1,10,2017,Jakarta,Enterprise
82391,Li and Partners,420000,820000,10.00%,400000,6,15,2016,Jakarta,Startup
18374,PT. Kreasi Metrik Solusi,0,550403,25.00%,0,3,29,2012,Surabaya,Enterprise
57531,PT. Algoritma Data Indonesia,850000,395500,4.00%,0,7,17,2017,Jakarta,Startup
19002,Palembang Konsultansi,2115000,0,-15.00%,0,2,24,2018,Bandung,Startup


1. Create a new column in the DataFrame and name it `Total Sales`. This column is a sum of `Consulting Sales` and `Software Sales`. Use `head` or `tail` to peek at the resulting data frame to confirm that the output matches your expectation. What is the sum of the `Total Sales` column?
    - [ ] 11,470,000
    - [ ] 19,238,903
    - [ ]  7,768,903


2. Which company has the biggest `Total Sales` in 2017?
    - [ ] New Media Group
    - [ ] PT. Algoritma Data Indonesia
    - [ ] Palembang Konsultansi
    

3. Which are the two companies that has sales exceeding 1,500,000 IDR in the sampled data frame?
    - [ ] Palembang Konsultansi & PT. Surya Citra Manajemen
    - [ ] PT. Surya Citra Manajemen & New Media Group
    - [ ] Palembang Konsultansi & New Media Group
    
    
4. The simplest way to ignore the outliers of sample data and find its central value is by using median instead of mean. By ignoring the outliers of `Total Sales`, what is its central value?
    - [ ] 1,354,250
    - [ ] 1,515,875
    - [ ] 3,737,700
    
5. If we want to perform subsetting on `clients` by explicitly stating the `ID`, which subsetting method is more appropriate?
    - [ ] `clients.loc[57531, ]`
    - [ ] `clients.iloc[57531, ]`
    - [ ] `clients[57531, ]` 