<a href="https://colab.research.google.com/github/sseamonds/python/blob/master/churn/1_churn_feature_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="https://github.com/logicalclocks/hopsworks-tutorials/blob/master/images/icon102.png?raw=1" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 01: Feature Pipeline</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/churn/1_churn_feature_pipeline.ipynb)


## 🗒️ This notebook is divided into the following sections:
1. Loading the data and feature engineering.
2. Connect to the Hopsworks feature store.
3. Create feature groups and upload them to the feature store.


![tutorial-flow](https://github.com/logicalclocks/hopsworks-tutorials/blob/master/images/01_featuregroups.png?raw=1)

First of all you will load the data and do some feature engineering on it.


The data you will use comes from three different CSV files:

- `demography.csv`: demographic informations.
- `customer_info.csv`: customer information such as contract type, billing methods and monthly charges as well as whether customer has churned within the last month.
- `subscriptions.csv`: customer subscription to services such as internet, mobile or movie streaming.

You can conceptualize these CSV files as originating from separate data sources.
**All three files have a customer id column `customerid` in common, which you can use for joins.**

Let's go ahead and load the data.

### <span style='color:#ff5f27'> 📝 Imports

In [12]:
!pip uninstall -y pyspark
!pip install -U hopsworks[python] --quiet

Found existing installation: pyspark 3.5.3
Uninstalling pyspark-3.5.3:
  Successfully uninstalled pyspark-3.5.3
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.6/90.6 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.2/44.2 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.0/4.0 MB[0m [31m57.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m32.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.6/258.6 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.

In [7]:
|pip show pandas

Name: pandas
Version: 2.2.2
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: 
Author-email: The Pandas Development Team <pandas-dev@python.org>
License: BSD 3-Clause License
        
        Copyright (c) 2008-2011, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
        All rights reserved.
        
        Copyright (c) 2011-2023, Open source contributors.
        
        Redistribution and use in source and binary forms, with or without
        modification, are permitted provided that the following conditions are met:
        
        * Redistributions of source code must retain the above copyright notice, this
          list of conditions and the following disclaimer.
        
        * Redistributions in binary form must reproduce the above copyright notice,
          this list of conditions and the following disclaimer in the documentation
          and/or other materials 

In [2]:
import pandas as pd

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

## <span style="color:#ff5f27;"> 💽 Loading the Data </span>


In [3]:
# Read demography data
demography_df = pd.read_csv("https://repo.hops.works/dev/davit/churn/demography.csv")

# Read customer info data with datetime parsing
customer_info_df = pd.read_csv(
    "https://repo.hops.works/dev/davit/churn/customer_info.csv",
    parse_dates=['datetime'],
)

# Read subscriptions data with datetime parsing
subscriptions_df = pd.read_csv(
    "https://repo.hops.works/dev/davit/churn/subscriptions.csv",
    parse_dates=['datetime'],
)

In [4]:
demography_df.head(3)

Unnamed: 0,customerID,gender,SeniorCitizen,Dependents,Partner
0,7590-VHVEG,Female,0,No,Yes
1,5575-GNVDE,Male,0,No,No
2,3668-QPYBK,Male,0,No,No


In [5]:
customer_info_df.head(3)

Unnamed: 0,customerID,Contract,tenure,PaymentMethod,PaperlessBilling,MonthlyCharges,TotalCharges,Churn,datetime
0,7590-VHVEG,Month-to-month,1,Electronic check,Yes,29.85,29.85,No,2021-10-25 15:07:18.625390512
1,5575-GNVDE,One year,34,Mailed check,No,56.95,1889.5,No,2020-06-28 06:32:24.674808292
2,3668-QPYBK,Month-to-month,2,Mailed check,Yes,53.85,108.15,Yes,2021-12-05 20:10:58.449304176


In [6]:
subscriptions_df.head(3)

Unnamed: 0,customerID,DeviceProtection,OnlineBackup,OnlineSecurity,InternetService,MultipleLines,PhoneService,TechSupport,StreamingMovies,StreamingTV,datetime
0,7590-VHVEG,No,Yes,No,DSL,No phone service,No,No,No,No,2021-10-25 15:07:18.625390512
1,5575-GNVDE,Yes,No,Yes,DSL,No,Yes,No,No,No,2020-06-28 06:32:24.674808292
2,3668-QPYBK,No,Yes,Yes,DSL,No,Yes,No,No,No,2021-12-05 20:10:58.449304176


---
## <span style="color:#ff5f27;"> 🛠️ Feature Engineering </span>

In this section you will perform feature engineering, such as converting textual features to numerical features and replacing missing values to 0s. Let's start with the Customer information feature group.

In [None]:
# Convert the "TotalCharges" column to numeric, treating errors as NaN
customer_info_df["TotalCharges"] = pd.to_numeric(
    customer_info_df["TotalCharges"],
    errors='coerce',
)

# Replace NaN values in the "TotalCharges" column with 0
customer_info_df["TotalCharges"].fillna(0, inplace=True)

# Replace values in the "Churn" column with 0 for "No" and 1 for "Yes"
customer_info_df["Churn"].replace({"No": 0, "Yes": 1}, inplace=True)

---
## <span style="color:#ff5f27;"> 🪄 Creating Feature Groups </span>

A [feature group](https://docs.hopsworks.ai/feature-store-api/latest/generated/feature_group/) can be seen as a collection of conceptually related features. In this case, you will create 3 feature groups:
1. Customer information
2. Customer demography
3. Customer subscibtion

As you can see feature groups are related to their source data. These feature groups have the same column as a primary key, which will allow you to join them when creating a dataset in the next tutorial.

Before you can create a feature group you need to connect to Hopsworks feature store.

In [13]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

Copy your Api Key (first register/login): https://c.app.hopsworks.ai/account/api/generated

Paste it here: ··········

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1193142


To create a feature group you need to give it a name and specify a primary key. It is also good to provide a description of the contents of the feature group.

In [16]:
# Get or create the 'customer_info' feature group
customer_info_fg = fs.get_or_create_feature_group(
    name="customer_info",
    version=1,
    description="Customer info for churn prediction.",
    primary_key=['customerID'],
    event_time="datetime",
)

In [20]:
type(customer_info_fg)
#customer_info_fg.created

'2024-12-04T14:56:32.000Z'

A full list of arguments can be found in the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/feature_store_api/#create_feature_group).

At this point, you have only specified some metadata for the feature group. It does not store any data or even have a schema defined for the data. To make the feature group persistent you need to populate it with its associated data using the `insert` function.

In [10]:
# Insert data into feature group
customer_info_fg.insert(customer_info_df)

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/1193142/fs/1182819/fg/1377752


Uploading Dataframe: 100.00% |██████████| Rows 7043/7043 | Elapsed Time: 00:01 | Remaining Time: 00:00


Launching job: customer_info_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1193142/jobs/named/customer_info_1_offline_fg_materialization/executions


(Job('customer_info_1_offline_fg_materialization', 'SPARK'), None)

In [11]:
# Update feature descriptions
feature_descriptions = [
    {"name": "customerid", "description": "Customer id"},
    {"name": "contract", "description": "Type of contact"},
    {"name": "tenure", "description": "How long they’ve been a customer"},
    {"name": "paymentmethod", "description": "Payment method"},
    {"name": "paperlessbilling", "description": "Whether customer has paperless billing or not"},
    {"name": "monthlycharges", "description": "Monthly charges"},
    {"name": "totalcharges", "description": "Total charges"},
    {"name": "churn", "description": "Whether customer has left within the last month or not"},
    {"name": "datetime", "description": "Date when the customer information was recorded"},
]

for desc in feature_descriptions:
    customer_info_fg.update_feature_description(desc["name"], desc["description"])

In [12]:
# Get or create the 'customer_demography_info' feature group
demography_fg = fs.get_or_create_feature_group(
    name="customer_demography_info",
    version=1,
    description="Customer demography info for churn prediction.",
    primary_key=['customerID'],
)
# Insert data into feature group
demography_fg.insert(demography_df)

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/1193142/fs/1182819/fg/1377753


Uploading Dataframe: 100.00% |██████████| Rows 7043/7043 | Elapsed Time: 00:01 | Remaining Time: 00:00


Launching job: customer_demography_info_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1193142/jobs/named/customer_demography_info_1_offline_fg_materialization/executions


(Job('customer_demography_info_1_offline_fg_materialization', 'SPARK'), None)

In [15]:
type(demography_fg)

In [16]:
# Update feature descriptions
feature_descriptions = [
    {"name": "customerid", "description": "Customer id"},
    {"name": "gender", "description": "Customer gender"},
    {"name": "seniorcitizen", "description": "Whether customer is a senior citizen or not"},
    {"name": "dependents", "description": "Whether customer has dependents or not"},
    {"name": "partner", "description": "Whether customer has partners or not"},
]

for desc in feature_descriptions:
    demography_fg.update_feature_description(desc["name"], desc["description"])

In [17]:
# Get or create the 'customer_subscription_info' feature group
subscriptions_fg = fs.get_or_create_feature_group(
    name="customer_subscription_info",
    version=1,
    description="Customer subscription info for churn prediction.",
    primary_key=['customerID'],
    event_time="datetime",
)
# Insert data into feature group
subscriptions_fg.insert(subscriptions_df)

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/1193142/fs/1182819/fg/1377754


Uploading Dataframe: 100.00% |██████████| Rows 7043/7043 | Elapsed Time: 00:01 | Remaining Time: 00:00


Launching job: customer_subscription_info_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1193142/jobs/named/customer_subscription_info_1_offline_fg_materialization/executions


(Job('customer_subscription_info_1_offline_fg_materialization', 'SPARK'), None)

In [18]:
# Update feature descriptions
feature_descriptions = [
    {"name": "customerid", "description": "Customer id"},
    {"name": "deviceprotection", "description": "Whether customer has signed up for device protection service"},
    {"name": "onlinebackup", "description": "Whether customer has signed up for online backup service"},
    {"name": "onlinesecurity", "description": "Whether customer has signed up for online security service"},
    {"name": "internetservice", "description": "Whether customer has signed up for internet service"},
    {"name": "multiplelines", "description": "Whether customer has signed up for multiple lines service"},
    {"name": "phoneservice", "description": "Whether customer has signed up for phone service"},
    {"name": "techsupport", "description": "Whether customer has signed up for tech support service"},
    {"name": "streamingmovies", "description": "Whether customer has signed up for streaming movies service"},
    {"name": "streamingtv", "description": "Whether customer has signed up for streaming TV service"},
    {"name": "datetime", "description": "Date when the customer information was recorded"},
]

for desc in feature_descriptions:
    subscriptions_fg.update_feature_description(desc["name"], desc["description"])

All three feature groups are now accessible and searchable in the UI

![fg-overview](https://github.com/logicalclocks/hopsworks-tutorials/blob/master/churn/images/churn_fg.gif?raw=1)

---
## <span style="color:#ff5f27;">⏭️ **Next:** Part 02 </span>

In the following notebook you will use your feature groups to create a train dataset, train a model and add a trained model to model registry.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/churn/2_churn_training_pipeline.ipynb)