# Week 6 - Data Cleansing & Manipulation
---

Mentoring Session - Job Preparation Program - Pacmann AI

In [1]:
# Please load this library
import numpy as np
import pandas as pd
import os

# Task 1. Merge Transactions Data Across Branches

---

## Task Descriptions
---


- Toko Serba Ada has several branches across the country.
- Toko Serba Ada manager wants to merge the transactions data across branches.
- Your task is to create a function to join multiple transaction files.
- Download the transactions files [here](https://drive.google.com/drive/folders/1bJ5EWEHwx3xXlSLVyUYjjK2D6v_br-hb?usp=sharing).

## Detail function
---

- Create a function called by `import_data`.
- The function only needs one input, `filenames` (`list`), a list of transactions data files.
- The `import_data` function will join every data listed on the filenames as a Pandas DataFrame.

## Examples
---

**Input**

```python
# Masukkan input
filenames = [
    'branch_A.xlsx',
    'branch_B.csv',
    'branch_C.csv'
]

# Import data
data = import_data(filenames = filenames)

# Validasi hasil
print('Data shape:', data.shape)
data.head(5)
```

**Output**
```
Data shape: (1000, 17)
```
<img src="https://drive.google.com/uc?id=10VjyzDyInVbeqb6E5a0AlnU5DuZCx3ef" />

## Answer
---


- Provide the code for solving the problem
- **Make sure your function follows the `Detail Function`**

In [163]:
# Write your function in here
def import_data(filenames):
    """ 
        This function is only to concatenate the rows of data
        and assume the column has the same name, length and same ordered
    """
    # create an Empty DataFrame object
    l_data = []
    if filenames:
        for item in filenames:
            data = read_data(item)
            l_data.append(data)
        data = pd.concat(l_data, axis=0, ignore_index=True)
        return data
    return "Empty Data"

    
def read_data(filename):
    _,file_extension= os.path.splitext(filename)
    r_data = check_format(file_extension,filename)    
    if not r_data.empty:
        return r_data
    
    print(f"Extention is not found : {file_extension}")

def check_format(f_extension,filename):
    # Dictionary of functions to read files based on extension
    s_format = {
        ".csv": lambda: pd.read_csv(filename, encoding='utf8', sep=';'),
        ".xlsx": lambda: pd.read_excel(filename)
    }
    
    # Check if the extension is supported
    if f_extension in s_format:
        return s_format[f_extension]()
    else:
        raise ValueError(f"Unsupported file extension: {f_extension}")


In [166]:
# Masukkan input
filenames = [
    'branch_A.xlsx',
    'branch_B.csv',
    'branch_C.csv'
]

# Import data
data = import_data(filenames = filenames)

# Validasi hasil
print('Data shape:', data.shape)
data.head(5)

Data shape: (1000, 17)


Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,cogs,gross margin percentage,gross income,Rating
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,1/5/2019,13:08,Ewallet,522.83,4.761905,26.1415,9.1
1,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,13:23,Credit card,324.31,4.761905,16.2155,7.4
2,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,23.288,489.048,1/27/2019,20:33,Ewallet,465.76,4.761905,23.288,8.4
3,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2/8/2019,10:37,Ewallet,604.17,4.761905,30.2085,5.3
4,355-53-5943,A,Yangon,Member,Female,Electronic accessories,68.84,6,20.652,433.692,2/25/2019,14:36,Ewallet,413.04,4.761905,20.652,5.8


# Task 2. Get the Unwatched Movie

---

## Task Descriptions
---

- You are a data analyst in a movie industry
- The product team ask you to recommend something new for a user to watch.
- You easily think of recommending the unwatched movies for a specific user Id.
- To recommend the unwatched movies nicely in the website, the engineering team needs you to return 3 things
  - `movieId`
  - `title`
  - `genres`
- Your task is to **create a function** to return the unwatched movies from a specific user id based on engineering team requirements.
- You can download your dataset in [here](https://drive.google.com/drive/folders/1HSa7KStIlOS7rXY5ykwGZR6l9P-AJrKj?usp=sharing).
  - `ratings.csv` contains the user activity after watching movies, i.e. give a rating to each movie they watched.
  - `movies.csv` contains the movie metadata (movie ID, title, and genre)
- The dataset originally comes from **MovieLens**

## Detail function
---

- Create a function called `get_unwatched_movie`
- The function needs two input
  - `userId` (`int`): The targeted user ID
  - `config` (`dict`): The configuration files where the engineering team store the user-data and movie metadata. Example

  ```python
  config = {
      'path': {
          'user_data': 'ratings.csv',
          'metadata': 'movies.csv'
      }
  }
  ```

- The function return an output in pandas DataFrame type with `movieId` as an index and two columns of `title` and `genres`.

## Examples
---

**Define the Configuration Variable**

```python
# Define CONFIG variable
CONFIG = {
    'path': {
        'user_data': 'ratings.csv',
        'metadata': 'movies.csv'
    }
}
```

---
**Input 1**

```python
# Cari unwatched data untuk userId = 3
unwatched_data = get_unwatched_movie(userId = 3,
                                     config = CONFIG)

print('Data shape:', unwatched_data.shape)
unwatched_data.sample(n=5, random_state=42)
```

**Output 1**
```
Data shape: (9703, 2)
```
<img src="https://drive.google.com/uc?id=18R0Ym9NplzBnu12hBU10DR8tgFBiQhp6"/>

---
**Input 2**

```python
# Cari unwatched data untuk userId = 10
unwatched_data = get_unwatched_movie(userId = 10,
                                     config = CONFIG)

print('Data shape:', unwatched_data.shape)
unwatched_data.sample(n=5, random_state=42)
```

**Output 2**
```
Data shape: (9602, 2)
```
<img src="https://drive.google.com/uc?id=1m8igXpZ5zS75ioV1tT8gIpLvtrh7TdrK"/>

---
**Input 3**

```python
# Cari unwatched data untuk userId = 3
unwatched_data = get_unwatched_movie(userId = 3,
                                     config = CONFIG)

print('Data shape:', unwatched_data.shape)
unwatched_data.sample(n=5, random_state=42)
```

**Output 3**
```
Data shape: (9402, 2)
```
<img src="https://drive.google.com/uc?id=1R-BLxcY8Bf3XUxB2Ikf95wafj_1iRNCg"/>

## Answer
---


- Provide the code for solving the problem
- **Make sure your function follows the `Detail Function`**

In [233]:
# Write your function in here
def get_unwatched_movie(userId,config):
    try:
        if userId:
            df = read_data(config)
            ratings = df['user_data']
            filter_user = ratings[ratings['userId'] == userId]
            merge_movies = df['metadata'].merge(filter_user,how='left',on='movieId')
            new_df = merge_movies[merge_movies['userId'].isnull()]
            unwatched_movies = transform_df(new_df)
            return unwatched_movies
    except Exception as err:
        raise ValueError(str(err))
        
        
def read_data(config):
    try:    
        path_data = {}
        data = config['path']
        for key,filename in data.items():
            _,file_extension= os.path.splitext(filename)
            path_data[key]= check_format(file_extension,filename)    
        if path_data:
            return path_data

        print(f"Extention is not found : {file_extension}")

    except Exception as err:
        raise ValueError(str(err))
        
def check_format(f_extension,filename):
    try:    
        # Dictionary of functions to read files based on extension
        s_format = {
            ".csv": lambda: pd.read_csv(filename),
            ".xlsx": lambda: pd.read_excel(filename)
        }

        # Check if the extension is supported
        if f_extension in s_format:
            return s_format[f_extension]()
        else:
            raise ValueError(f"Unsupported file extension: {f_extension}")
    except Exception as err:
        raise ValueError(str(err))

def transform_df(df):
    try:
        data = df[["movieId","title","genres"]]
        data.set_index("movieId",inplace=True)
        return data
    except Exception as err:
        raise ValueError(str(err))


In [237]:
# Define CONFIG variable
CONFIG = {
    'path': {
        'user_data': 'ratings.csv',
        'metadata': 'movies.csv'
    }
}

# Cari unwatched data untuk userId = 3
unwatched_data = get_unwatched_movie(userId = 10,
                                     config = CONFIG)

print('Data shape:', unwatched_data.shape)
unwatched_data
unwatched_data.sample(n=5, random_state=42)

Data shape: (9602, 2)


Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1797,Everest (1998),Documentary|IMAX
179819,Star Wars: The Last Jedi (2017),Action|Adventure|Fantasy|Sci-Fi
77266,Disgrace (2008),Drama
26003,Night and Fog (Nuit et brouillard) (1955),Crime|Documentary|War
54686,"Last Legion, The (2007)",Action|Adventure|Fantasy|War


# Task 3. Get the House Recommendation

---

## Task Descriptions
---

- Assume you work as a Data Analyst in Travelio.
- The product team request you to give its users housing recommendations based on their current location and housing preferences.
- Please create a function to answer the product team request.
- You can find the dataset [here](https://drive.google.com/file/d/1D5phg8q0MiX4lRKlEaBWHT07MoEgEr28/view?usp=sharing).
- **Note**: The dataset is scrapped by Pacmann from the Travelio website for educational purposes only.

## Detail function
---

- Create a function called by `get_user_recommendation` that return the top-$n$ housing recommendation for a specific user location & preferences sorted by the nearest distance between user location and house location.
- The function needs three input, i.e.
    - `n` (`int`): the maximum number of recommendation.
    - `user_config` (`dict`): the user configuration data. It contains the user preferences and user current location.
    - `data_config` (`dict`): the data configuration that contains the housing data path.
- The output is a dataframe type with similar data columns to the dataset.
---
- We filter using 5 preferences, that is
  - `property_type`. It should return `apartment` or `house`.
  - `size`. It should return houses that is **larger than or equal to** the given `size`.
  - `capacity`. It should return houses that is **more than or equal to** the given `capacity`.
  - `is_furnished`. It should return `Full Furnished` or `Unfurnished`.
  - `yearly_price`. It should return houses that is **less than or equal to** the given `yearly_price` rent
- If user fill nothing (`None`), then you should not filter anything.
---
- Please use the **Haversine** distance to calculate the distance between user and houses.
- We intentionally not giving you the Haversine distance formula. Please explore it by yourself.

## Examples
---

---
**Input 1**

```python
# Define the user data
user_config = {
    'preferences': {
        'property_type': None,
        'size': 30.0,
        'capacity': 2,
        'is_furnished': 'Full Furnished',
        'yearly_price': 50000000
    },
    'location': {
        # Dekat Bintaro Plaza
        'latitude': -6.2734,
        'longitude': 106.7364
    }
}

data_config = {
    'path': 'travelio_dki_jakarta.csv'
}

# Run the function
user_recommendation = get_user_recommendation(n = 10,
                                              user_config = user_config,
                                              data_config = data_config)

# Validate
print('Data Shape:', user_recommendation.shape)
user_recommendation
```

**Output 1**
```
Data Shape: (10, 16)
```
<img src="https://drive.google.com/uc?id=1Ek8VjhgOqWh18T1zEvn0b5zZIKlMt1wG"/>

---
**Input 2**

```python
# Define the user data
user_config = {
    'preferences': {
        'property_type': None,
        'size': 45.0,
        'capacity': 4,
        'is_furnished': None,
        'yearly_price': 25000000
    },
    'location': {
        # Dekat Monumen Nasional (Monas)
        'latitude': -6.1792,
        'longitude': 106.8265
    }
}

data_config = {
    'path': 'travelio_dki_jakarta.csv'
}

# Run the function
user_recommendation = get_user_recommendation(n = 10,
                                              user_config = user_config,
                                              data_config = data_config)

# Validate
print('Data Shape:', user_recommendation.shape)
user_recommendation
```

**Output 2**
```
Data Shape: (10, 16)
```
<img src="https://drive.google.com/uc?id=14eIe-BjfjTM53Y3m9ObdoQ2nKVWgUGiY"/>

---
**Input 3**

```python
# Define the user data
user_config = {
    'preferences': {
        'property_type': None,
        'size': 60.0,
        'capacity': 4,
        'is_furnished': None,
        'yearly_price': 25000000
    },
    'location': {
        # Dekat Kota Tua Jakarta
        'latitude': -6.1378,
        'longitude': 106.8144
    }
}

data_config = {
    'path': 'travelio_dki_jakarta.csv'
}

# Run the function
user_recommendation = get_user_recommendation(n = 10,
                                              user_config = user_config,
                                              data_config = data_config)

# Validate
print('Data Shape:', user_recommendation.shape)
user_recommendation
```

**Output 3**
```
Data Shape: (6, 16)
```
<img src="https://drive.google.com/uc?id=1WAjuLElzpxuECoh8ArhD2XeWEj1T3blk"/>

## Answer
---

- Provide the code for solving the problem
- **Make sure your function follows the `Detail Function`**

In [18]:
# Write your function in here
def get_user_recommendation(n,data_config):
    try:
        if n and data_config:
            df = read_data(data_config)
            return df
    except Exception as err:
        raise ValueError(str(err))
        
def read_data(config):
    try:    
        for _,filename in config.items():
            _,file_extension= os.path.splitext(filename)
            path_data= check_format(file_extension,filename)    
        if path_data:
            return path_data

        print(f"Extention is not found : {file_extension}")

    except Exception as err:
        raise ValueError(str(err))
        
def check_format(f_extension,filename):
    try:    
        # Dictionary of functions to read files based on extension
        print(f_extension)
        print(filename)

        s_format = {
            ".csv": lambda: pd.read_csv(filename),
            ".xlsx": lambda: pd.read_excel(filename)
        }

        # Check if the extension is supported
        if f_extension in s_format:
            return s_format[f_extension]()
        else:
            raise ValueError(f"Unsupported file extension: {f_extension}")
    except Exception as err:
        raise ValueError(str(err))
        
        
data_config = {
    'path': 'travelio_dki_jakarta.csv'
}

# Run the function
user_recommendation = get_user_recommendation(n = 10,
                                              data_config = data_config)



.csv
travelio_dki_jakarta.csv


ValueError: [Errno 2] File b'travelio_dki_jakarta.csv' does not exist: b'travelio_dki_jakarta.csv'

In [17]:
d = pd.read_csv("travelio_dki_jakarta.csv")

FileNotFoundError: [Errno 2] File b'travelio_dki_jakarta.csv' does not exist: b'travelio_dki_jakarta.csv'

# Task 4. Export the Promising State

---

## Task Descriptions
---


- Assumed you are a data analyst in Amazon.
- Your supervisor ask you to export a promising state sales data based on its market share to a .csv files, thus each state representatives can analyst the sales data further.
- A promising state is a state that has its market share bigger or equal to a specified threshold.
- The market share of a specific state is defined as number of order on a specific state / total order.
- Write a function to help your supervisor!
- Download your data in [here](https://drive.google.com/file/d/1oRAPo7ZST2i_pHAIWP2_KoLraniUwyME/view?usp=sharing).
- The actual data source is in [here](https://www.kaggle.com/datasets/thedevastator/unlock-profits-with-e-commerce-sales-data?select=Amazon+Sale+Report.csv).

## Detail function
---

- Create a function called by `export_promising_state`
- This function needs two inputs
  - `config_file` (`dict`) contains the input and output path
  - `thresh` (`float`) contains the given market share threshold.
- This function returns nothing.
- If you cannot find any promising state based on the given threshold, then print `No promising state`.
- If you can find promising state,
  - First, drop column `index` and `Unnamed: 22` from the promising data.
  - Save the promising data with format: `folder_path` + `state-name` + `-sales-reports.csv`, e.g.: `sales_data/telangana-sales-reports.csv`
  - Write the prompt after successfully exporting data that includes the state market share and state sales data shape.

## Examples
---

**Define the Configuration Variable**

```python
# Define CONFIG variable
config_file = {
    'path': {
        'input': 'Amazon Sale Report.csv',
        'output': 'sales_data/'
    }
}
```

---
**Input 1**

```python
# Input 1
export_promising_state(config_file = config_file,
                       thresh = 0.10)
```

**Output 1**
```
Data of state "karnataka" was successfully exported into "sales_data/karnataka-sales-reports.csv"
  - State market share : 13.43 %
  - Data shape         : (17326, 22)

Data of state "maharashtra" was successfully exported into "sales_data/maharashtra-sales-reports.csv"
  - State market share : 17.26 %
  - Data shape         : (22260, 22)
```

Example of the created files: <br>
<img src="https://drive.google.com/uc?id=1C1r8SKoRHbKX0upPl5VDrzuf4Mi0joiC"/>

---
**Input 2**

```python
# Input 2
export_promising_state(config_file = config_file,
                       thresh = 0.05)
```

**Output 2**
```
Data of state "telangana" was successfully exported into "sales_data/telangana-sales-reports.csv"
  - State market share : 8.78 %
  - Data shape         : (11330, 22)

Data of state "kerala" was successfully exported into "sales_data/kerala-sales-reports.csv"
  - State market share : 5.11 %
  - Data shape         : (6585, 22)

Data of state "delhi" was successfully exported into "sales_data/delhi-sales-reports.csv"
  - State market share : 5.40 %
  - Data shape         : (6967, 22)

Data of state "uttar pradesh" was successfully exported into "sales_data/uttar pradesh-sales-reports.csv"
  - State market share : 8.25 %
  - Data shape         : (10638, 22)

Data of state "karnataka" was successfully exported into "sales_data/karnataka-sales-reports.csv"
  - State market share : 13.43 %
  - Data shape         : (17326, 22)

Data of state "tamil nadu" was successfully exported into "sales_data/tamil nadu-sales-reports.csv"
  - State market share : 8.90 %
  - Data shape         : (11483, 22)

Data of state "maharashtra" was successfully exported into "sales_data/maharashtra-sales-reports.csv"
  - State market share : 17.26 %
  - Data shape         : (22260, 22)
```

Example of the created files: <br>
<img src="https://drive.google.com/uc?id=1ujeDK87N4MLk1_9Uew4lkIeFA81_DtDw"/>

---
**Input 3**

```python
# Input 3
export_promising_state(config_file = config_file,
                       thresh = 0.4)
```

**Output 3**
```
No promising state
```
<img src=""/>

## Answer
---


- Provide the code for solving the problem
- **Make sure your function follows the `Detail Function`**

In [None]:
# Write your function in here
# Psst.. You can build as many functions you need.
#        Just make sure the function is
#        - Modular
#        - Clean (easy to read & with docstring)
#        - Can be easily validated


# Task 5.  Clean the AirBnB data
---

## Task Descriptions
---


- You are given a dataset of guests and hosts of AirBnB.
- This dataset contains the review given by a guest (`id`) to the listing.
- You want to analyze the review given by the guests by `neighborhood group`.
- But the data is not clean.
- Please clean the data by
  1. Dropping data with missing value
  2. Removing the unconsitency in `neighborhood group`
  3. Dropping the listing outliers/anomaly.
    - listing with anomaly rent price (please use IQR method to filter the outlier).
    - listing with anomaly `availability 365` (`availability 365` is defined as an indicator of the total number of days the listing is available for during the year)
  4. Drop duplicates data (if any)

**Dataset**
- Please download the dataset in [here](https://drive.google.com/file/d/19zOwcAkd7lTC_djAMgc5u1B7I2iPj5ek/view?usp=sharing)
- The actual data source is [here](https://www.kaggle.com/datasets/arianazmoudeh/airbnbopendata)

## Expected Output
---


The output should be:
```
Clean data shape: (98174, 26)
```

With the sample of the clean data

<center>
<img src="https://drive.google.com/uc?id=1ZmVMW41RbxCuKUcafic5CL9E0JAemdNk">

## Answer
---


- Provide the code for getting the expected output

In [None]:
# Write your code in here
# Provide your code with sufficient comment on the wrangling processes

# Task 6. Calculate Month-Over-Month Percentage Change in Sales


---

## Task Descriptions
---


- Given a dataset of sales report by date, **calculate the mont-over-month percentage change in sales**.
- The ouput should include the year-month date (YYYY-MM) and percentage change, rounded to the 2nd decimal point, and sorted by `order-date` in ascending order.
- The percentage change column will be populated from the 2nd month forward and can be calculated as
$$
\cfrac
{(\text{this month's sales - last month's sales})}
{\text{last month's sales}} \cdot 100\%
$$

**Dataset**
- Please download the dataset in [here](https://drive.google.com/file/d/13QxDig8cXrT5ErVO2tytYdmksjbH3tep/view?usp=sharing)
- The actual data source is [here](https://www.kaggle.com/datasets/apoorvaappz/global-super-store-dataset)

## Expected Output
---


The output should be:

```
Data Shape: (48, 3)
````

With the first 12 entry of the final data is

<center>
<img src="https://drive.google.com/uc?id=1Qcj_OUVOqfFCwbau0XUJK9URL3BrX86u">

## Answer
---


- Provide the code for getting the expected output

In [None]:
# Write your code in here
# Provide your code with sufficient comment on the wrangling processes

# Task 7. Time to Purchase Duration


---

## Task Descriptions
---


- Given a dataset of an e-commerce events history in Electronic shop.
- Your task is to
  1. **calculate each user's time to purchase duration** and
  2. **find the tendency of user view-purchase duration** (please use a proper measure of central tendency).
- Time to purchase duration is defined as the time difference between `view` event and `purchase` event. Consider only the earliest view and purchase.
- The ouput should include `user_id` and their `view_purchase_duration` in minutes.

**Dataset**
- Please download the dataset in [here](https://drive.google.com/file/d/1GfFkxIbAivdY8bqbFbARCiyTzIuooKvV/view?usp=sharing)
- The actual data source is [here](https://www.kaggle.com/datasets/mkechinov/ecommerce-events-history-in-electronics-store/data)

## Expected Output
---


The output should be:
```
Data shape : (7847, 2)
Summary of user's view to purchase duration : 4.7 minutes
```

With the sample of final data

<center>
<img src="https://drive.google.com/uc?id=159zrsppCuawgqn9q0e1B63KJzqe9Pvz9">

## Answer
---


- Provide the code for getting the expected output

In [None]:
# Write your code in here
# Provide your code with sufficient comment on the wrangling processes