<a href="https://colab.research.google.com/github/thaanirs/amazon_ML_23/blob/master/amazon_ml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Product length prediction

In this hackathon, the goal is to develop a machine learning model that can predict the length dimension of a product. Product length is crucial for packaging and storing products efficiently in the warehouse. Moreover, in many cases, it is an important attribute that customers use to assess the product size before purchasing. However, measuring the length of a product manually can be time-consuming and error-prone, especially for large catalogs with millions of products.

You will have access to the product title, description, bullet points, product type ID, and product length for 2.2 million products to train and test your submissions. Note that there is some noise in the data.

## Task

You are required to build a machine learning model that can predict product length from catalog metadata.

## Dataset description

The dataset folder contains the following files: 

train.csv: 2249698 x 6
test.csv: 734736 x 5
sample_submission.csv: 734736 x 2
The columns provided in the dataset are as follows:

| Column name | Description |
| ------------ | -----------|
|PRODUCT_ID|	Represents a unique identification of a product|
|TITLE	|Represents the title of the product|
|DESCRIPTION	|Represents the description of the product|
|BULLET_POINTS	|Represents the bullet points about the product|
|PRODUCT_TYPE_ID|	Represents the product type |
|PRODUCT_LENGTH	|Represents the length of the product|




In [3]:
!wget https://s3-ap-southeast-1.amazonaws.com/he-public-data/datasetb2d9982.zip 

--2023-04-23 17:09:45--  https://s3-ap-southeast-1.amazonaws.com/he-public-data/datasetb2d9982.zip
Resolving s3-ap-southeast-1.amazonaws.com (s3-ap-southeast-1.amazonaws.com)... 52.219.128.246
Connecting to s3-ap-southeast-1.amazonaws.com (s3-ap-southeast-1.amazonaws.com)|52.219.128.246|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 895569552 (854M) [binary/octet-stream]
Saving to: ‘datasetb2d9982.zip’


2023-04-23 17:11:42 (7.35 MB/s) - ‘datasetb2d9982.zip’ saved [895569552/895569552]



In [4]:
!unzip ./datasetb2d9982 

Archive:  ./datasetb2d9982.zip
   creating: dataset/
  inflating: dataset/sample_submission.csv  
  inflating: dataset/train.csv       
  inflating: dataset/test.csv        


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('./dataset/train.csv')

In [None]:
df.head(20)

Unnamed: 0,PRODUCT_ID,TITLE,BULLET_POINTS,DESCRIPTION,PRODUCT_TYPE_ID,PRODUCT_LENGTH
0,1925202,ArtzFolio Tulip Flowers Blackout Curtain for D...,[LUXURIOUS & APPEALING: Beautiful custom-made ...,,1650,2125.98
1,2673191,Marks & Spencer Girls' Pyjama Sets T86_2561C_N...,"[Harry Potter Hedwig Pyjamas (6-16 Yrs),100% c...",,2755,393.7
2,2765088,PRIKNIK Horn Red Electric Air Horn Compressor ...,"[Loud Dual Tone Trumpet Horn, Compatible With ...","Specifications: Color: Red, Material: Aluminiu...",7537,748.031495
3,1594019,ALISHAH Women's Cotton Ankle Length Leggings C...,[Made By 95%cotton and 5% Lycra which gives yo...,AISHAH Women's Lycra Cotton Ankel Leggings. Br...,2996,787.401574
4,283658,The United Empire Loyalists: A Chronicle of th...,,,6112,598.424
5,2152929,HINS Metal Bucket Shape Plant Pot for Indoor &...,"[Simple and elegant, great for displaying indo...",HINS Brings you the most Elegant Looking Pot w...,5725,950.0
6,413758,Ungifted: My Life and Journey,,,23,598.0
7,2026580,Delavala Self Adhesive Kitchen Backsplash Wall...,[HIGH QUALITY PVC MATERIAL: The kitchen alumin...,<p><strong>Aluminum Foil Stickers-good kitchen...,6030,984.251967
8,2050239,PUMA Cali Sport Clean Women's Sneakers White L...,[Style Name:-Cali Sport Clean Women's Sneakers...,,3302,393.7
9,2998633,Hexwell Essential oil for Home Fragrance Oil A...,[100% Pure And Natural Essential Oil Or Fragra...,"Transform your home, workplace or hotel room i...",8201,393.700787


In [None]:
(df['BULLET_POINTS'].head(20))

0     [LUXURIOUS & APPEALING: Beautiful custom-made ...
1     [Harry Potter Hedwig Pyjamas (6-16 Yrs),100% c...
2     [Loud Dual Tone Trumpet Horn, Compatible With ...
3     [Made By 95%cotton and 5% Lycra which gives yo...
4                                                   NaN
5     [Simple and elegant, great for displaying indo...
6                                                   NaN
7     [HIGH QUALITY PVC MATERIAL: The kitchen alumin...
8     [Style Name:-Cali Sport Clean Women's Sneakers...
9     [100% Pure And Natural Essential Oil Or Fragra...
10    [Good quality and Suitable to use.,This Produc...
11                                                  NaN
12                                                  NaN
13                                                  NaN
14    [Segovia bottle consists of stainless steel wh...
15                                                  NaN
16                                                  NaN
17    [High Impact ABS Material Shell,Replaceabl

In [None]:
df.isnull().sum()

PRODUCT_ID               0
TITLE                   12
BULLET_POINTS       837364
DESCRIPTION        1157381
PRODUCT_TYPE_ID          0
PRODUCT_LENGTH           0
dtype: int64

In [None]:
len(df.PRODUCT_TYPE_ID.unique())

12907

In [None]:
len(df)

2249698

In [None]:
len(df) - len(df.TITLE.unique())

38935

In [None]:
df.columns

Index(['PRODUCT_ID', 'TITLE', 'BULLET_POINTS', 'DESCRIPTION',
       'PRODUCT_TYPE_ID', 'PRODUCT_LENGTH'],
      dtype='object')

In [None]:
df['TITLE'].head()

0    ArtzFolio Tulip Flowers Blackout Curtain for D...
1    Marks & Spencer Girls' Pyjama Sets T86_2561C_N...
2    PRIKNIK Horn Red Electric Air Horn Compressor ...
3    ALISHAH Women's Cotton Ankle Length Leggings C...
4    The United Empire Loyalists: A Chronicle of th...
Name: TITLE, dtype: object

In [None]:
test = pd.read_csv("./dataset/test.csv")
test.head()

Unnamed: 0,PRODUCT_ID,TITLE,BULLET_POINTS,DESCRIPTION,PRODUCT_TYPE_ID
0,604373,Manuel d'Héliogravure Et de Photogravure En Re...,,,6142
1,1729783,DCGARING Microfiber Throw Blanket Warm Fuzzy P...,[QUALITY GUARANTEED: Luxury cozy plush polyest...,<b>DCGARING Throw Blanket</b><br><br> <b>Size ...,1622
2,1871949,I-Match Auto Parts Front License Plate Bracket...,"[Front License Plate Bracket Made Of Plastic,D...",Replacement for The Following Vehicles:2020 LE...,7540
3,1107571,PinMart Gold Plated Excellence in Service 1 Ye...,[Available as a single item or bulk packed. Se...,Our Excellence in Service Lapel Pins feature a...,12442
4,624253,"Visual Mathematics, Illustrated by the TI-92 a...",,,6318


In [None]:
sam = pd.read_csv("./dataset/sample_submission.csv")
sam.head()

Unnamed: 0,PRODUCT_ID,PRODUCT_LENGTH
0,604373,701.093794
1,1729783,734.506163
2,1871949,741.360258
3,1107571,730.327767
4,624253,666.847946


## preprocessing

In [None]:
df_dropped = df.drop(['BULLET_POINTS','DESCRIPTION','TITLE'],axis='columns')

In [None]:
df_dropped.head(20)

Unnamed: 0,PRODUCT_ID,PRODUCT_TYPE_ID,PRODUCT_LENGTH
0,1925202,1650,2125.98
1,2673191,2755,393.7
2,2765088,7537,748.031495
3,1594019,2996,787.401574
4,283658,6112,598.424
5,2152929,5725,950.0
6,413758,23,598.0
7,2026580,6030,984.251967
8,2050239,3302,393.7
9,2998633,8201,393.700787


In [None]:
list(df_dropped.TITLE.unique())

AttributeError: ignored

In [None]:
df_dropped.dtypes

In [None]:
sorted(df_dropped.PRODUCT_TYPE_ID.unique())

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score


In [None]:
X,y = df_dropped.drop('PRODUCT_LENGTH',axis='columns'),df_dropped['PRODUCT_LENGTH']

In [None]:
len(X)

In [None]:
len(y)

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,shuffle=True,test_size=0.2)

In [None]:
linmod = LinearRegression()
linmod.fit(X_train,y_train)
# linmod.score(X_test,y_test)

In [None]:
linmod.score(X_test,y_test)

In [None]:
from sklearn import metrics
metrics.mean_squared_log_error( linmod.predict(X_test) ,y_test)

In [None]:
dt = DecisionTreeRegressor()
dt.fit(X_train,y_train)
# dt.score(X_test,y_test)

In [None]:
metrics.mean_squared_log_error(dt.predict(X_test),y_test)

In [None]:
rdt = RandomForestRegressor()
rdt.fit(X_train,y_train)
metrics.mean_squared_log_error(rdt.predict(X_test),y_test)

In [None]:
len(df_dropped['PRODUCT_LENGTH'].unique())

## classification

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB,MultinomialNB

In [None]:
df.columns

In [None]:
df_dropped = df.drop(['PRODUCT_ID','TITLE','BULLET_POINTS','DESCRIPTION'],axis='columns')

In [None]:
df_dropped.head()

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['PRODUCT_LENGTH'] = le.fit_transform(df_dropped.PRODUCT_LENGTH)

In [None]:
X_train,X_test,y_train,y_test = train_test_split(df_dropped[['PRODUCT_TYPE_ID']],df_dropped.PRODUCT_LENGTH,test_size=0.3,shuffle=True,random_state=50)

In [None]:
X_train.head()

In [None]:
svm = SVC()
svm.fit(X_train,y_train)

# NLP

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [None]:
df = pd.read_csv('./dataset/train.csv')


In [None]:
df[df['BULLET_POINTS'].isnull() & df['DESCRIPTION'].isnull()].isnull().sum()

In [None]:
# df['TITLE'].fillna(' ')
# df.isnull().sum()
df[df['TITLE'].isna()]

In [None]:
df.dropna(axis=0,subset='TITLE',inplace=True)

In [None]:
df.isnull().sum()

In [None]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['TITLE'])
X = X.toarray()

In [None]:

X_train, X_test, y_train, y_test = train_test_split(X, df['PRODUCT_LENGTH'], test_size=0.2, random_state=42)


model = LinearRegression()

model.fit(X_train, y_train)


y_pred = model.predict(X_test)

# Calculate mean squared error
mse = mean_squared_error(y_test, y_pred)
print('Mean Squared Error:', mse)


In [None]:
def NLP_pp(col):
  pass

In [None]:
df['TITLE'].head().apply( lambda x : x.lower() )

In [None]:
import string
df['TITLE'].head().apply( lambda x : x.translate(str.maketrans('','',string.punctuation)))

In [None]:
''.join( [ i for i in text ] )

In [40]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
!pip install spacy

In [45]:
!pip install autocorrect

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting autocorrect
  Downloading autocorrect-2.6.1.tar.gz (622 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m622.8/622.8 kB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: autocorrect
  Building wheel for autocorrect (setup.py) ... [?25l[?25hdone
  Created wheel for autocorrect: filename=autocorrect-2.6.1-py3-none-any.whl size=622380 sha256=496df15a9ef0aff1eddbcc25d05d02d99598d118b9d11e8447a4300be80ac5c4
  Stored in directory: /root/.cache/pip/wheels/ab/0f/23/3c010c3fd877b962146e7765f9e9b08026cac8b035094c5750
Successfully built autocorrect
Installing collected packages: autocorrect
Successfully installed autocorrect-2.6.1


In [55]:
import itertools
from autocorrect import Speller
text="A farmmer will lovdd this food"
#One letter in a word should not be present more than twice in continuation
text_correction = ''.join(''.join(s)[:2] for _, s in itertools.groupby(text))
print("Normal Text:n{}".format(text_correction))
spell = Speller(lang='en')
ans = spell(text_correction)
print("After correcting text:n{}".format(ans))

Normal Text:nA farmmer will lovdd this food
After correcting text:nA farmer will loved this food


In [52]:
list(itertools.groupby(text))

[('A', <itertools._grouper at 0x7fa4e5751490>),
 (' ', <itertools._grouper at 0x7fa4e57516a0>),
 ('f', <itertools._grouper at 0x7fa4e57516d0>),
 ('a', <itertools._grouper at 0x7fa4e5751700>),
 ('r', <itertools._grouper at 0x7fa4e57515b0>),
 ('m', <itertools._grouper at 0x7fa4e57517c0>),
 ('e', <itertools._grouper at 0x7fa4e57517f0>),
 ('r', <itertools._grouper at 0x7fa4e5751820>),
 (' ', <itertools._grouper at 0x7fa4e5751850>),
 ('w', <itertools._grouper at 0x7fa4e5751880>),
 ('i', <itertools._grouper at 0x7fa4e57518b0>),
 ('l', <itertools._grouper at 0x7fa4e57518e0>),
 (' ', <itertools._grouper at 0x7fa4e5751910>),
 ('l', <itertools._grouper at 0x7fa4e5751940>),
 ('o', <itertools._grouper at 0x7fa4e5751970>),
 ('v', <itertools._grouper at 0x7fa4e57519a0>),
 ('d', <itertools._grouper at 0x7fa4e57519d0>),
 (' ', <itertools._grouper at 0x7fa4e5751a00>),
 ('t', <itertools._grouper at 0x7fa4e5751a30>),
 ('h', <itertools._grouper at 0x7fa4e5751a60>),
 ('i', <itertools._grouper at 0x7fa4e575

In [41]:
from nltk.corpus import stopwords
from spacy.lang.en.stop_words import STOP_WORDS

In [42]:
text = "I had such high hopes for this dress 1-5 size to work for me." 
STOPWORDS = set(stopwords.words('english'))
ans = " ".join([word for word in str(text).split() if word not in STOPWORDS])
ans

'I high hopes dress 1-5 size work me.'

In [36]:
!pip install contractions

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [73]:
import contractions
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from spacy.lang.en.stop_words import STOP_WORDS
from autocorrect import Speller
import string

def basic_process(text):

  # nltk.download('punkt')
  # df['TITLE'].head().apply( lambda x : x.translate(str.maketrans('','',string.punctuation)))
  text = text.lower() #lowe
  text =  ''.join([ i for i in text if not i.isdigit()]) #digit rekmoval
  text = text.translate(str.maketrans('','',string.punctuation))
  text = ' '.join(text.split()) # space removeal
  text = contractions.fix(text) # contractions removal
  # text = ' '.join([ i for i in text.split() if 'htt' not in i ]) #link removal
  # text = ' '.join([ i for i in text.split() if '@' not in i ]) #email removal
  # stopwords removal
  STOPWORDS = set(stopwords.words('english'))
  # text = " ".join([word for word in str(text).split() if word not in STOPWORDS])


  text = ' '.join([ i for i in str(text).split() if ('htt' not in i ) and ('@' not in i) and (i not in STOPWORDS)]) #link removal

  # spell chck
  spell = Speller(lang='en')
  text = spell(text)
  
  # tokenisation doing in exvery step (split 🤨)
  # lemmatisation / stemming
  porter_stemmer = PorterStemmer()
  # for word in word_tokenize(text):
    # word = porter_stemmer.stem(word)

  text = ' '.join( [ porter_stemmer.stem(i) for i in word_tokenize(text) ] )

  return text

In [74]:
df['TITLE'].head(100).apply(lambda x:basic_process(x))

0     artzfolio tulip flower blackout curtain door w...
1                 mark spencer girl panama set navi mix
2     priknik horn red electr air horn compressor in...
3     alishah women cotton ankl length log combo plu...
4             unit empir loyalist chronicl great migrat
                            ...                        
95                                             carl pop
96    gener chiffon print dupatta golden dot hang gi...
97                                 caught act loveswept
98    globalniche® leather car key case cover fiesta...
99             mountain bigfoot adult shirt brown small
Name: TITLE, Length: 100, dtype: object

# new way

In [2]:
# import important modules
import numpy as np
import pandas as pd
# sklearn modules
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB # classifier 
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    # plot_confusion_matrix,
)
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
# text preprocessing modules
from string import punctuation 
# text preprocessing modules
from nltk.tokenize import word_tokenize
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
import re #regular expression
# Download dependency
for dependency in (
    "brown",
    "names",
    "wordnet",
    "averaged_perceptron_tagger",
    "universal_tagset",
):
    nltk.download(dependency)
    
import warnings
warnings.filterwarnings("ignore")
# seeding
np.random.seed(123)

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


In [3]:
# data = pd.read_csv("../data/labeledTrainData.tsv", sep='\t')
data = pd.read_csv("./dataset/train.csv")

In [4]:
data.head()

Unnamed: 0,PRODUCT_ID,TITLE,BULLET_POINTS,DESCRIPTION,PRODUCT_TYPE_ID,PRODUCT_LENGTH
0,1925202,ArtzFolio Tulip Flowers Blackout Curtain for D...,[LUXURIOUS & APPEALING: Beautiful custom-made ...,,1650,2125.98
1,2673191,Marks & Spencer Girls' Pyjama Sets T86_2561C_N...,"[Harry Potter Hedwig Pyjamas (6-16 Yrs),100% c...",,2755,393.7
2,2765088,PRIKNIK Horn Red Electric Air Horn Compressor ...,"[Loud Dual Tone Trumpet Horn, Compatible With ...","Specifications: Color: Red, Material: Aluminiu...",7537,748.031495
3,1594019,ALISHAH Women's Cotton Ankle Length Leggings C...,[Made By 95%cotton and 5% Lycra which gives yo...,AISHAH Women's Lycra Cotton Ankel Leggings. Br...,2996,787.401574
4,283658,The United Empire Loyalists: A Chronicle of th...,,,6112,598.424


In [5]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [6]:
stop_words =  stopwords.words('english')
def text_cleaning(text, remove_stop_words=True, lemmatize_words=True):
    # Clean the text, with the option to remove stop_words and to lemmatize word
    # Clean the text
    text = re.sub(r"[^A-Za-z0-9]", " ", text)
    text = re.sub(r"\'s", " ", text)
    text =  re.sub(r'http\S+',' link ', text)
    text = re.sub(r'\b\d+(?:\.\d+)?\s+', '', text) # remove numbers
        
    # Remove punctuation from text
    text = ''.join([c for c in text if c not in punctuation])
    
    # Optionally, remove stop words
    if remove_stop_words:
        text = text.split()
        text = [w for w in text if not w in stop_words]
        text = " ".join(text)
    
    # Optionally, shorten words to their stems
    if lemmatize_words:
        text = text.split()
        lemmatizer = WordNetLemmatizer() 
        lemmatized_words = [lemmatizer.lemmatize(word) for word in text]
        text = " ".join(lemmatized_words)
    
    # Return a list of words
    return(text)

In [8]:
data['TITLE']=data['TITLE'].astype(str)

In [9]:
data.dtypes

PRODUCT_ID           int64
TITLE               object
BULLET_POINTS       object
DESCRIPTION         object
PRODUCT_TYPE_ID      int64
PRODUCT_LENGTH     float64
dtype: object

In [11]:
data["cleaned_title"] = data["TITLE"].apply(text_cleaning)

In [22]:
data.tail()

Unnamed: 0,PRODUCT_ID,TITLE,BULLET_POINTS,DESCRIPTION,PRODUCT_TYPE_ID,PRODUCT_LENGTH,cleaned_title
2249693,2422167,Nike Women's As W Ny Df Swsh Hn Kh Bra (CZ7610...,Material : Polyester,,3009,1181.1,Nike Women As W Ny Df Swsh Hn Kh Bra CZ7610 Bl...
2249694,2766635,"(3PCS) Goose Game Cute Cartoon Enamel Pins, Fu...",[❤ [Inspiration] Inspired by the Untitled Goos...,<p><b>[Brand]: </b>XVIEONR</p> <p><br></p> <p>...,3413,125.984252,3PCS Goose Game Cute Cartoon Enamel Pins Funny...
2249695,1987786,Kangroo Sweep Movement Printed Wooden Wall Clo...,"[Dial size: 12 inches in diameter,Big, clear r...",Wall Clocks Are Very Attractive In Looks And E...,1574,1200.0,Kangroo Sweep Movement Printed Wooden Wall Clo...
2249696,1165754,Electro Voice EKX-BRKT15 | Wall Mount Bracket ...,,,592,2900.0,Electro Voice EKX BRKT15 Wall Mount Bracket EK...
2249697,1072666,Skyjacker C7360SP Component Box For PN[C7360PK...,"[Component Box For PN[C7360PK],4 in. Lift,Incl...",Skyjacker C7360SP Component Box For PN[C7360PK...,7367,2000.0,Skyjacker C7360SP Component Box For PN C7360PK...


In [12]:
X = data[["cleaned_title",'PRODUCT_TYPE_ID']]
y = data[['PRODUCT_LENGTH']]

In [13]:
X_train, X_valid, y_train, y_valid = train_test_split(
    X,
    y,
    test_size=0.15,
    random_state=42,
    shuffle=True,
    # stratify=y,
)

In [14]:
X_train.head()

Unnamed: 0,cleaned_title,PRODUCT_TYPE_ID
1174126,ZOOEASS Oval Holographic Reflective Shoelace F...,3101
1493197,WULFUL Men Slim Fit Skinny Stretch Comfy Denim...,2835
1754766,MNG Hardware Poise Knob Back Plate Large Polis...,10189
768621,Classic Holiday Standards,804
1154493,Interlanguage Forty year later Language Learni...,94


In [30]:
X_valid.head()

Unnamed: 0,cleaned_title,PRODUCT_TYPE_ID
90943,Calendrier sacre maya 2006,1
1001689,Fractured Reality,114
536614,The Golden Butterfly daring illusion age perfo...,8501
369746,10x30 Contemporary Bronze Complete Wood Panora...,12228
175997,Quality By Experimental Design 3rd Edition Qua...,6320


In [40]:
y_train.head()

Unnamed: 0,PRODUCT_LENGTH
1174126,5512.0
1493197,100.0
1754766,200.0
768621,490.0
1154493,650.0


In [41]:
y_valid.head()

Unnamed: 0,PRODUCT_LENGTH
90943,492.125984
1001689,500.0
536614,2.0
369746,3150.0
175997,700.0


In [15]:
sentiment_classifier = Pipeline(steps=[
                               ('pre_processing',TfidfVectorizer(lowercase=False)),
                                 ('naive_bayes',MultinomialNB())
                                 ])

In [42]:
X_train.shape

(1912243, 2)

In [43]:
y_train.shape

(1912243, 1)

In [16]:
cc  =CountVectorizer()

In [17]:
cc.fit(X_train['cleaned_title'])

In [18]:
transformed = cc.transform(X_train['cleaned_title'])

In [19]:
transformed = pd.DataFrame.sparse.from_spmatrix(transformed)

In [24]:
transformed.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,685193,685194,685195,685196,685197,685198,685199,685200,685201,685202
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
pd.DataFrame(transformed)

In [20]:
tf = TfidfVectorizer()

In [21]:
tf.fit(X_train['cleaned_title'])

In [22]:
d = tf.transform(X_train['cleaned_title'])

In [None]:
pd.DataFrame(d)

In [23]:
d = pd.DataFrame.sparse.from_spmatrix(d)

In [25]:
d.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,685193,685194,685195,685196,685197,685198,685199,685200,685201,685202
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
data['PRODUCT_LENGTH'].head()

0    2125.980000
1     393.700000
2     748.031495
3     787.401574
4     598.424000
Name: PRODUCT_LENGTH, dtype: float64

In [27]:
from sklearn.preprocessing import LabelEncoder

In [28]:
le = LabelEncoder()

In [40]:
le.fit(y_train)

In [41]:
encoded_target  = le.transform(y_train)

In [37]:
len(data['PRODUCT_LENGTH'])

2249698

In [36]:
len(encoded_target)

2249698

In [38]:
transformed.shape

(1912243, 685203)

In [39]:
d.shape

(1912243, 685203)

In [42]:
# df = pd.concat(transformed,encoded_target)
# df.head()
transformed['target'] = encoded_target

In [43]:
transformed.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,685194,685195,685196,685197,685198,685199,685200,685201,685202,target
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,11875
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1052


In [1]:
# sentiment_classifier.fit(X_train,y_train)
model = MultinomialNB()
model.fit(transformed.drop('target',axis='columns'),encoded_target)

NameError: ignored

In [None]:
pred = model.predict(X_valid)
pred

In [None]:
y_preds = sentiment_classifier.predict(X_valid)

In [None]:
accuracy_score(y_valid,y_preds)