# Light Semantic Data Transformations - Applied 

In [23]:
# Let's get our dependencies
import sys
!{sys.executable} -m pip install pandas
import pandas as pd
from datetime import datetime

# some constants
DATA_FILE_PATH="data_US.csv"
DATA_FILE_PATH_DAY2="data_US_day2.csv"



## Hidden Assumptions? ##

In [10]:
df = pd.read_csv(DATA_FILE_PATH)
print("This is our dataframe...\n", df.head(5))

This is our dataframe...
   SalesRep     LeadName
0    James  CompanyA_US
1    James  CompanyB_IT
2      Eve  CompanyC_NZ
3      Bob  CompanyD_US
4      Eve  CompanyE_IT


**Your task**: We have a CRM system where we pull a bunch of data sets from. Here we have Sales Representatives as well as 
the names of the leads the sales reps type in by hand as you can see. The CRM team is pretty busy so they are not able to add fields to the 
system.

*However, the sales reps came up with the idea of suffixing the leads with the countries of origin to be able to group them somehow. Now they
are asking you to produce a simple report displaying the number of leads per rep per country. Let's get to it!*

In [21]:
## 1. Let us deduce the country of origin
df['country'] = df['LeadName'].apply(lambda x: x[-2:])

## 2. Let's do the grouping & aggregation
print(df.groupby(by=["SalesRep", "country"]).count())

                  LeadName
SalesRep country          
Bob      US              2
Eve      IT              1
         NZ              1
James    IT              3
         US              2


## The Problem? Let's take a look at the next day ##
A sales rep called "Tim" asks you what's up with the dashboard, something looks really weird with his numbers. He needs them fixed now! Because they got some big sale reporting coming up.

John is new on the data team and he's taking a look into this (he did not build the pipeline above!).

In [25]:
df = pd.read_csv(DATA_FILE_PATH_DAY2)

## 1. Let us deduce the country of origin
df['country'] = df['LeadName'].apply(lambda x: x[-2:])

## 2. Let's do the grouping & aggregation
print(df.groupby(by=["SalesRep", "country"]).count())

                  LeadName
SalesRep country          
Bob      US              2
Eve      IT              1
         NZ              1
James    IT              3
         US              2
Tim      ER              1
         yJ              1


**Transparency?**: Now John is a bit confused, indeed these don't seem like "countries". John takes some time to 
find the code, fix it and is off. Problem fixed right?

To fix it, John has to make the deduction logic a bit more complex (splitting on the underdash and ignoring leads without them, if he's 
really helpful he uses "no_country_suffix_detected" as value there).

**Implications**: Now, the "country" logic becomes more and more complex. Sounds like an important data piece, so it might be used in lots of analyses. What happens if the logic changes again? Maybe some sales reps decide to use lower caps and want that fixed, maybe some sales reps forget the underscore. If it's not John whos looking at the next problem, things will take time, and energy to fix... Especially because it will be really hard to tell where the "country" actually comes from.

In [41]:
df = pd.read_csv(DATA_FILE_PATH_DAY2)

## 1. Let us deduce the country of origin
df['country'] = df['LeadName'].apply(lambda x: x[x.find('_')+1:] if x.find('_') !=-1 else "no_country_suffix_detected")

## 2. Let's do the grouping & aggregation
print(df.groupby(by=["SalesRep", "country"]).count())

                                     LeadName
SalesRep country                             
Bob      US                                 2
Eve      IT                                 1
         NZ                                 1
James    IT                                 3
         US                                 2
Tim      GER                                1
         no_country_suffix_detected         1


## How This Can Be Improved ##
Using the three key ideas you've learned in the first part, you might be able to improve the first pipeline. 

It might sounds totally trivial, but you can increase the transparency and reduce the complexity simply by changing the "naming". 

In [51]:
df = pd.read_csv(DATA_FILE_PATH_DAY2)

## 1. Let us deduce the "country_suffix"
df['Country Suffix (follows a \"_\")'] = df['LeadName'].apply(lambda x: x[-2:])

## 2. Let's do the grouping & aggregation
print(df.groupby(by=["SalesRep", "Country Suffix (follows a \"_\")"]).count())

                                         LeadName
SalesRep Country Suffix (follows a "_")          
Bob      US                                     2
Eve      IT                                     1
         NZ                                     1
James    IT                                     3
         US                                     2
Tim      ER                                     1
         yJ                                     1


**Let's fix the problem again**: This time, it takes John almost no time to deduce the problem, as the data display
is pretty clearn. In fact, Tim might notice it himself and simply fix his problem on the CRM systems side.

Still, we're going to fix this...

In [59]:
df = pd.read_csv(DATA_FILE_PATH_DAY2)

## 1. Let us deduce the "country_suffix"
df['Country Suffix (follows a \"_\")'] = df['LeadName'].apply(lambda x: x[x.find('_')+1:] if x.find('_') !=-1 else "no underscore for suffix detected")

## 2. Let' do the grouping & aggregation
print(df.groupby(by=["SalesRep", "Country Suffix (follows a \"_\")"]).count())

                                            LeadName
SalesRep Country Suffix (follows a "_")             
Bob      US                                        2
Eve      IT                                        1
         NZ                                        1
James    IT                                        3
         US                                        2
Tim      GER                                       1
         no underscore for suffix detected         1


**The Benefits**: It's transparent, easy to fix for endusers as well as devs. Endusers will have more ownership over data they input in 
the CRM systems.

**Follow Up**: Once we go into the next iteration with the "Country Suffix" stuff, things will become really complex. In that case,
it will make sense to follow up with making "the transformation logic transparent" e.g. by using mapping tables and display them to 
the end user. This will really decouple that logic, making it testable and easy to work on.