<a href="https://colab.research.google.com/github/sasikanth1113/End-to-End-Python/blob/master/Data_Preprocessing_and_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Processing and Analysis
Data Processing is the most important and most time consuming component of the overall lifecycle of any Machine Learning project.

In this notebook, we will analyze a dummy dataset to understand different issues we face with real world datasets and steps to handle the same.

# Utilities
We add in some utility functions here which we will be using across this notebook. We have also packaged it into a utils.py file which you can use offline. Since we will be using colab for the tutorials, we add in all the functions in the same notebook to save the hassle of file uploads and drive connects

In [0]:
import datetime
import random
from random import randrange
import numpy as np
import pandas as pd


def _random_date(start,date_count):
    """This function generates a random date based on params
    Args:
        start (date object): the base date
        date_count (int): number of dates to be generated
    Returns:
        list of random dates

    """
    current = start
    while date_count > 0:
        curr = current + datetime.timedelta(days=randrange(42))
        yield curr
        date_count-=1
        
        

def generate_sample_data(row_count=100):
    """This function generates a random transaction dataset
    Args:
        row_count (int): number of rows for the dataframe
    Returns:
        a pandas dataframe

    """

    # sentinels
    startDate = datetime.datetime(2016, 1, 1, 13)
    serial_number_sentinel = 1000
    user_id_sentinel = 5001
    product_id_sentinel = 101
    price_sentinel = 2000

    # base list of attributes
    data_dict = {
        'Serial No':
        np.arange(row_count) + serial_number_sentinel,
        'Date':
        np.random.permutation(
            pd.to_datetime([
                x.strftime("%d-%m-%Y")
                for x in _random_date(startDate, row_count)
            ]).date),
        'User ID':
        np.random.permutation(
            np.random.randint(0, row_count, size=int(row_count / 10)) +
            user_id_sentinel).tolist() * 10,
        'Product ID':
        np.random.permutation(
            np.random.randint(0, row_count, size=int(row_count / 10)) +
            product_id_sentinel).tolist() * 10,
        'Quantity Purchased':
        np.random.permutation(np.random.randint(1, 42, size=row_count)),
        'Price':
        np.round(
            np.abs(np.random.randn(row_count) + 1) * price_sentinel,
            decimals=2),
        'User Type':
        np.random.permutation(
            [chr(random.randrange(97, 97 + 3 + 1)) for i in range(row_count)])
    }

    # introduce missing values
    for index in range(int(np.sqrt(row_count))):
        data_dict['Price'][np.argmax(
            data_dict['Price'] == random.choice(data_dict['Price']))] = np.nan
        data_dict['User Type'][np.argmax(
            data_dict['User Type'] == random.choice(
                data_dict['User Type']))] = np.nan
        data_dict['Date'][np.argmax(
            data_dict['Date'] == random.choice(data_dict['Date']))] = np.nan
        data_dict['Product ID'][np.argmax(data_dict['Product ID'] == random.
                                          choice(data_dict['Product ID']))] = 0
        data_dict['Serial No'][np.argmax(data_dict['Serial No'] == random.
                                         choice(data_dict['Serial No']))] = -1
        data_dict['User ID'][np.argmax(data_dict['User ID'] == random.choice(
            data_dict['User ID']))] = -101

    # create data frame
    df = pd.DataFrame(data_dict)

    return df

In [0]:
import numpy as np
import pandas as pd
from IPython.display import display
from sklearn import preprocessing

pd.options.mode.chained_assignment = None

#### Generate Dataset
Question: Generate 1000 sample rows

In [5]:
## Generate a dataset with 1000 rows
df = generate_sample_data(row_count=1000)
df.shape

(1000, 7)

Analyze generated Dataset

In [6]:
df.head()

Unnamed: 0,Serial No,Date,User ID,Product ID,Quantity Purchased,Price,User Type
0,1000,,-101,0,23,3269.85,n
1,1001,,5243,209,35,2014.99,n
2,1002,2016-01-15,5622,1017,19,5008.5,n
3,1003,2016-01-29,5089,743,28,,n
4,1004,,5179,613,14,377.25,n


### Dataframe Stats
#### Determine the following:

- The number of data points (rows). (Hint: check out the dataframe .shape attribute.)
- The column names. (Hint: check out the dataframe .columns attribute.)
- The data types for each column. (Hint: check out the dataframe .dtypes attribute.)


In [7]:
print("Number of rows::",df.shape[0])

Number of rows:: 1000


Question
- Get the number of columns

In [8]:
print("Number of columns::",df.shape[1])

Number of columns:: 7


In [9]:
print("Column Names::",df.columns.values.tolist())

Column Names:: ['Serial No', 'Date', 'User ID', 'Product ID', 'Quantity Purchased', 'Price', 'User Type']


In [10]:
print("Column Data Types::\n",df.dtypes)

Column Data Types::
 Serial No               int64
Date                   object
User ID                 int64
Product ID              int64
Quantity Purchased      int64
Price                 float64
User Type              object
dtype: object


In [11]:
print("Columns with Missing Values::",df.columns[df.isnull().any()].tolist())

Columns with Missing Values:: ['Date', 'Price']


In [12]:
print("Number of rows with Missing Values::",len(pd.isnull(df).any(1).nonzero()[0].tolist()))


Number of rows with Missing Values:: 62


  """Entry point for launching an IPython kernel.


In [13]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
Serial No             1000 non-null int64
Date                  969 non-null object
User ID               1000 non-null int64
Product ID            1000 non-null int64
Quantity Purchased    1000 non-null int64
Price                 969 non-null float64
User Type             1000 non-null object
dtypes: float64(1), int64(4), object(2)
memory usage: 54.8+ KB
None


In [14]:
print(df.describe())

         Serial No      User ID   Product ID  Quantity Purchased        Price
count  1000.000000  1000.000000  1000.000000         1000.000000   969.000000
mean   1450.945000  5497.766000   596.272000           20.039000  2304.632817
std     384.168376   330.554757   283.027459           11.787554  1560.091313
min      -1.000000  -101.000000     0.000000            1.000000     0.810000
25%    1224.750000  5243.000000   361.000000            9.000000  1111.360000
50%    1480.500000  5509.000000   616.000000           19.000000  2104.250000
75%    1736.250000  5758.750000   844.750000           30.000000  3225.970000
max    1999.000000  5992.000000  1093.000000           41.000000  8429.130000


#### Standardize Columns
Question
- Use columns attribute and tolist() method to get the list of all columns

In [15]:
# list all columns
print("Dataframe columns:\n{}".format(df.columns.tolist()))

Dataframe columns:
['Serial No', 'Date', 'User ID', 'Product ID', 'Quantity Purchased', 'Price', 'User Type']


### Utility to Standardize Columns
- Question : We usually use lowercase-snakecased column names in python. Write a utility method to do the same. You may user methods like lower, replace. Setting inplace = True avoid creating a copy of your dataframe

In [0]:
def cleanup_column_names(df,rename_dict={},do_inplace=True):
    """This function renames columns of a pandas dataframe
       It converts column names to snake case if rename_dict is not passed. 
    Args:
        rename_dict (dict): keys represent old column names and values point to 
                            newer ones
        do_inplace (bool): flag to update existing dataframe or return a new one
    Returns:
        pandas dataframe if do_inplace is set to False, None otherwise

    """
    if not rename_dict:
        # lower case and replace <space> with <underscore>
        return df.rename(columns={col: col.lower().replace(' ','_') 
                            for col in df.columns.values.tolist()}, 
                         inplace=True)
    else:
        return df.rename(columns=rename_dict,inplace=do_inplace)

In [0]:
cleanup_column_names(df)

In [18]:
# Updated column names
print("Dataframe columns:\n{}".format(df.columns.tolist()))

Dataframe columns:
['serial_no', 'date', 'user_id', 'product_id', 'quantity_purchased', 'price', 'user_type']


### Basic Manipulation

#### Sort basis specific attributes
- Question: Sort serial_no in ascending and price in descending order.

In [19]:
# Ascending for Serial No and Descending for Price
display(df.sort_values(['serial_no', 'price'], 
                         ascending=[True, False]).head())

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type
895,-1,2016-01-27,5372,126,15,5902.34,d
67,-1,,5633,872,32,5451.02,d
641,-1,2016-01-17,5958,716,18,4035.13,b
53,-1,2016-01-27,5417,393,30,4024.55,b
865,-1,2016-10-02,5961,458,14,3685.11,b


**Reorder columns**

In [20]:
display(df[['serial_no','date','user_id','user_type',
              'product_id','quantity_purchased','price']].head())

Unnamed: 0,serial_no,date,user_id,user_type,product_id,quantity_purchased,price
0,1000,,-101,n,0,23,3269.85
1,1001,,5243,n,209,35,2014.99
2,1002,2016-01-15,5622,n,1017,19,5008.5
3,1003,2016-01-29,5089,n,743,28,
4,1004,,5179,n,613,14,377.25


**Select attributes**

In [21]:
# Using Column Index
# print 10 values from column at index 3
print(df.iloc[:,3].values[0:10])

[   0  209 1017  743  613  586  131  989  185  735]


In [22]:
# Using Column Name
# print 10 values of quantity purchased
print(df.quantity_purchased.values[0:10])

[23 35 19 28 14  9 31 23 14 31]


In [23]:
# Using Datatype
# print 10 values of columns with data type float
print(df.select_dtypes(include=['float64']).values[:10,0])

[3269.85 2014.99 5008.5      nan  377.25  978.17  473.23 2394.04 1133.4
  114.32]


**Select rows**

In [24]:
# Using Row Index
display(df.iloc[[10,501,20]])

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type
10,-1,2016-01-23,5171,361,6,1745.66,n
501,1501,2016-01-15,5243,209,7,389.52,a
20,1020,,5423,695,7,567.83,a


In [25]:
# Exclude specific rows
display(df.drop([0,24,51], axis=0).head())

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type
1,1001,,5243,209,35,2014.99,n
2,1002,2016-01-15,5622,1017,19,5008.5,n
3,1003,2016-01-29,5089,743,28,,n
4,1004,,5179,613,14,377.25,n
5,1005,2016-01-31,5507,586,9,978.17,n


**Question**
- Show only rows which have quantity purchased greater than 25

In [26]:
# Conditional Filtering
# Quantity_Purchased greater than 25
display(df[df.quantity_purchased > 25].head())

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type
1,1001,,5243,209,35,2014.99,n
3,1003,2016-01-29,5089,743,28,,n
6,1006,,5538,131,31,473.23,n
9,1009,2016-03-01,5839,735,31,114.32,n
11,1011,2016-01-23,5772,1017,39,1848.22,n


In [27]:
# Offset from Top
display(df[100:].head())

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type
100,1100,2016-07-02,5553,748,6,1988.92,d
101,1101,2016-05-02,5243,209,2,1078.91,b
102,1102,2016-01-20,5622,1017,1,2276.24,d
103,1103,2016-06-01,5089,743,13,3876.4,c
104,1104,2016-01-02,5179,613,39,1830.42,d


In [28]:
# Offset from Bottom
display(df[-10:].head())

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type
990,-1,2016-01-17,5372,951,18,1353.36,d
991,1991,2016-01-14,5279,858,3,1715.95,b
992,1992,2016-01-24,5008,539,39,1403.04,d
993,1993,2016-02-02,5010,909,34,3116.13,b
994,1994,2016-07-02,5795,286,40,2052.13,b


**Type Casting**

In [29]:
# Existing Datatypes
df.dtypes

serial_no               int64
date                   object
user_id                 int64
product_id              int64
quantity_purchased      int64
price                 float64
user_type              object
dtype: object

In [30]:
# Set Datatime as dtype for date column
df['date'] = pd.to_datetime(df.date)
print(df.dtypes)

serial_no                      int64
date                  datetime64[ns]
user_id                        int64
product_id                     int64
quantity_purchased             int64
price                        float64
user_type                     object
dtype: object


**Map/Apply Functionality**

**Question**
- Write a utility method to create a new column user_class from user_type using the following mapping:
- user_type a and b map to user_class new
- user_type c maps to user_class existing
- user_type d maps to user_class loyal_existing
- map all other user_type values as error

In [0]:
def expand_user_type(u_type):
    """This function maps user types to user classes
    Args:
        u_type (str): user type value
    Returns:
        (str) user_class value

    """
    if u_type in ['a','b']:
        return 'new'
    elif u_type == 'c':
        return 'existing'
    elif u_type == 'd':
        return 'loyal_existing'
    else:
        return 'error'

In [32]:
# Map User Type to User Class
df['user_class'] = df['user_type'].map(expand_user_type)
display(df.tail())

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type,user_class
995,1995,2016-01-24,5372,126,4,3327.28,a,new
996,1996,2016-01-19,5217,491,16,2438.89,d,loyal_existing
997,1997,2016-11-01,5565,583,21,310.05,a,new
998,1998,2016-01-13,5721,844,34,991.62,d,loyal_existing
999,1999,2016-07-02,5224,859,28,3156.08,a,new


Question
- Get range for each numeric attribute, i.e. max-min

In [33]:
display(df.select_dtypes(include=[np.number]).apply(lambda x: 
                                                        x.max()- x.min()))

serial_no             2000.00
user_id               6093.00
product_id            1093.00
quantity_purchased      40.00
price                 8428.32
dtype: float64

In [0]:
# Apply-Map: Extract Week from Date
df['purchase_week'] = df[['date']].applymap(lambda dt:dt.week 
                                                if not pd.isnull(dt.week) 
                                                else 0)

In [35]:
display(df.head())

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type,user_class,purchase_week
0,1000,NaT,-101,0,23,3269.85,n,error,0
1,1001,NaT,5243,209,35,2014.99,n,error,0
2,1002,2016-01-15,5622,1017,19,5008.5,n,error,2
3,1003,2016-01-29,5089,743,28,,n,error,4
4,1004,NaT,5179,613,14,377.25,n,error,0


### Handling missing values

In [36]:
# Drop Rows with Missing Dates
df_dropped = df.dropna(subset=['date'])
display(df_dropped.head())

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type,user_class,purchase_week
2,1002,2016-01-15,5622,1017,19,5008.5,n,error,2
3,1003,2016-01-29,5089,743,28,,n,error,4
5,1005,2016-01-31,5507,586,9,978.17,n,error,4
7,1007,2016-01-31,5992,989,23,2394.04,n,error,4
9,1009,2016-03-01,5839,735,31,114.32,n,error,9


In [0]:
# Filling missing price with mean price
df_dropped['price'].fillna(value=np.round(df.price.mean(),decimals=2),
                                inplace=True)

In [0]:
# Fill missing user types using values from previous row
df_dropped['user_type'].fillna(method='ffill',inplace=True)

### Handle Duplicates
Question
- Identify duplicates only for column serial_no

In [39]:
# sample duplicates. Identify for serial_no
display(df_dropped[df_dropped.duplicated(subset=['serial_no'])].head())
print("Shape of df={}".format(df_dropped.shape))

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type,user_class,purchase_week
23,-1,2016-01-02,5464,822,33,3273.87,n,error,53
53,-1,2016-01-27,5417,393,30,4024.55,b,new,4
115,-1,2016-01-27,5501,628,25,2860.28,d,loyal_existing,4
145,-1,2016-06-01,5418,437,4,3137.01,b,new,22
239,-1,2016-01-02,5017,390,10,594.48,d,loyal_existing,53


Shape of df=(969, 9)


In [40]:
 ##Drop Duplicates
df_dropped.drop_duplicates(subset=['serial_no'],inplace=True)
display(df_dropped.head())
print("Shape of df={}".format(df_dropped.shape))

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type,user_class,purchase_week
2,1002,2016-01-15,5622,1017,19,5008.5,n,error,2
3,1003,2016-01-29,5089,743,28,2304.63,n,error,4
5,1005,2016-01-31,5507,586,9,978.17,n,error,4
7,1007,2016-01-31,5992,989,23,2394.04,n,error,4
9,1009,2016-03-01,5839,735,31,114.32,n,error,9


Shape of df=(940, 9)


**Question**
- Remove rows which have less than 3 attributes with non-missing data
- Print the shape of dataframe thus prepared

In [41]:
# Remove rows which have less than 3 attributes with non-missing data
display(df.dropna(thresh=3).head())
print("Shape of df={}".format(df.dropna(thresh=3).shape))

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type,user_class,purchase_week
0,1000,NaT,-101,0,23,3269.85,n,error,0
1,1001,NaT,5243,209,35,2014.99,n,error,0
2,1002,2016-01-15,5622,1017,19,5008.5,n,error,2
3,1003,2016-01-29,5089,743,28,,n,error,4
4,1004,NaT,5179,613,14,377.25,n,error,0


Shape of df=(1000, 9)


### Handle Categoricals
**One Hot Encoding**

In [42]:
display(pd.get_dummies(df,columns=['user_type']).head())

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_class,purchase_week,user_type_a,user_type_b,user_type_c,user_type_d,user_type_n
0,1000,NaT,-101,0,23,3269.85,error,0,0,0,0,0,1
1,1001,NaT,5243,209,35,2014.99,error,0,0,0,0,0,1
2,1002,2016-01-15,5622,1017,19,5008.5,error,2,0,0,0,0,1
3,1003,2016-01-29,5089,743,28,,error,4,0,0,0,0,1
4,1004,NaT,5179,613,14,377.25,error,0,0,0,0,0,1


#### Label Encoding
Question
- Use a dictionary to encode user_types in sequence of numbers. Replace missing/Nan's with -1

In [43]:
type_map = {'a': 0, 'b': 1, 'c': 2, 'd': 3, np.NAN: -1}
df['encoded_user_type'] = df.user_type.map(type_map)
display((df.tail()))

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type,user_class,purchase_week,encoded_user_type
995,1995,2016-01-24,5372,126,4,3327.28,a,new,3,0.0
996,1996,2016-01-19,5217,491,16,2438.89,d,loyal_existing,3,3.0
997,1997,2016-11-01,5565,583,21,310.05,a,new,44,0.0
998,1998,2016-01-13,5721,844,34,991.62,d,loyal_existing,2,3.0
999,1999,2016-07-02,5224,859,28,3156.08,a,new,26,0.0


### **Handle Numerical Attributes**
### **Min-Max Scalar**
**Question**
- Control the range of numerical attribute price by using MinMaxScaler transformer

In [0]:
df_normalized = df.dropna().copy()
min_max_scaler = preprocessing.MinMaxScaler()
np_scaled = min_max_scaler.fit_transform(df_normalized['price'].values.reshape(-1,1))
df_normalized['price'] = np_scaled.reshape(-1,1)

**Robust Scaler**

In [0]:
df_normalized = df.dropna().copy()
robust_scaler = preprocessing.RobustScaler()
rs_scaled = robust_scaler.fit_transform(df_normalized['quantity_purchased'].values.reshape(-1,1))
df_normalized['quantity_purchased'] = rs_scaled.reshape(-1,1)

In [46]:
display(df_normalized.head())

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type,user_class,purchase_week,encoded_user_type
24,1024,2016-01-22,5180,320,-0.86747,692.9,d,loyal_existing,3,3.0
27,1027,2016-09-02,5724,405,-0.144578,2448.42,a,new,35,0.0
32,1032,2016-03-02,5210,871,0.048193,2052.48,d,loyal_existing,9,3.0
33,1033,2016-08-02,5959,594,1.060241,2834.89,d,loyal_existing,31,3.0
34,1034,2016-09-02,5327,821,-0.096386,4052.01,a,new,35,0.0


### **Group-By**
**Question**
- Group By attribute user_class and get sum of quantity_purchased
Hint: you may want to use Pandas groupby method to group by certain attributes before calculating the statistic.



In [47]:
# Group By attributes user_class and get sum of quantity_purchased
print(df.groupby(['user_class'])['quantity_purchased'].sum())

user_class
error              659
existing          4392
loyal_existing    5114
new               9874
Name: quantity_purchased, dtype: int64


In [48]:
 #Aggregate Functions. Sum, Mean and Non Zero Row Count
display(
    df.groupby(['user_class'])['quantity_purchased'].agg(
        [np.sum, np.mean, np.count_nonzero]))

Unnamed: 0_level_0,sum,mean,count_nonzero
user_class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
error,659,21.258065,31
existing,4392,20.333333,216
loyal_existing,5114,19.444867,263
new,9874,20.15102,490


In [49]:
display(df.groupby(['user_class','user_type']).agg({'price':np.mean,
                                                        'quantity_purchased':np.max}))

Unnamed: 0_level_0,Unnamed: 1_level_0,price,quantity_purchased
user_class,user_type,Unnamed: 2_level_1,Unnamed: 3_level_1
error,n,1998.481667,39
existing,c,2345.008261,41
loyal_existing,d,2358.663398,41
new,a,2260.412788,41
new,b,2292.58768,41


In [50]:
# Multiple Aggregate Functions
display(
    df.groupby(['user_class', 'user_type']).agg({
        'price': {
            'total_price': np.sum,
            'mean_price': np.mean,
            'variance_price': np.std,
            'count': np.count_nonzero
        },
        'quantity_purchased': np.sum
    }))

in a future version.

For column-specific groupby renaming, use named aggregation

    >>> df.groupby(...).agg(name=('column', aggfunc))

  return super().aggregate(arg, *args, **kwargs)


Unnamed: 0_level_0,Unnamed: 1_level_0,price,price,price,price,quantity_purchased
Unnamed: 0_level_1,Unnamed: 1_level_1,total_price,mean_price,variance_price,count,sum
user_class,user_type,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
error,n,59954.45,1998.481667,1542.409722,31.0,659
existing,c,485416.71,2345.008261,1443.811155,216.0,4392
loyal_existing,d,603817.83,2358.663398,1582.447298,263.0,5114
new,a,510853.29,2260.412788,1638.852527,235.0,4723
new,b,573146.92,2292.58768,1565.618614,255.0,5151


### **Pivot Tables**

In [51]:
display(df.pivot_table(index='date', columns='user_type', 
                         values='price',aggfunc=np.mean))

user_type,a,b,c,d,n
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2016-01-01,1779.835,2686.263333,1771.993333,2226.45,
2016-01-02,2728.5,2151.105455,1986.3225,2012.01,1791.805
2016-01-13,2186.888333,2647.95625,2059.875,1987.285455,
2016-01-14,1911.054286,2475.24375,5391.355,3427.183333,
2016-01-15,2808.4025,3193.993333,2835.05,1960.786667,3630.905
2016-01-16,1615.92625,2014.19,3236.77,4402.62,
2016-01-17,1744.336,2660.75,2372.905,2218.95,1152.31
2016-01-18,2801.45,1621.2425,1842.843333,3213.638,
2016-01-19,726.206667,1881.148,2078.983333,2617.224286,
2016-01-20,2672.77,1207.628,2479.979231,2515.611667,2353.225


**Stacking**

In [52]:
print(df.stack())

0    serial_no                1000
     user_id                  -101
     product_id                  0
     quantity_purchased         23
     price                 3269.85
                            ...   
999  price                 3156.08
     user_type                   a
     user_class                new
     purchase_week              26
     encoded_user_type           0
Length: 9907, dtype: object
