# Shopee product matching Data analysis

* Information about shopee dataset

### During data exploration we will answer following Questions:

1. How many different type of products in the dataset?
2. which catagory have the highest number of products?
3. How many products in each catagory?
4. How many images related to one products?
5. What the titles for each product in the same label_group? and what is the common words (most frequent word) in product's title in the same label group?

## Download the Data

## Data Preparation and Cleaning 

In [None]:
import pandas as pd 

In [None]:
df_train = pd.read_csv('../input/shopee-product-matching/train.csv')

In [None]:
df_train

In [None]:
df_train.info()

Missing values per column

In [None]:
df_train.isna().sum()

## Exploratory Analysis and Visualization

In [None]:
df_train.columns

### Label_group

- How many different type of products in the dataset?

In [None]:
len(df_train.label_group.unique())

In [None]:
products_by_label_group = df_train.label_group.value_counts()
products_by_label_group 

In [None]:
products_by_label_group[:20].index.astype(str)

In [None]:
import matplotlib.pyplot as plt
plt.barh(products_by_label_group[:20].index.astype(str), products_by_label_group[:20].values)


In [None]:
import seaborn as sns
sns.set_style('darkgrid')
sns.histplot(df_train.label_group, log_scale=True)

In [None]:
sns.distplot(products_by_label_group)

most label_group have less than 10 products

- which catagory have the highest number of products?
- How many products in each catagory?

In [None]:
high_product_group = products_by_label_group[products_by_label_group >= 10]
len(high_product_group) / len(products_by_label_group)

In [None]:
low_product_group = products_by_label_group[products_by_label_group < 10]
len(low_product_group) / len(products_by_label_group)

In [None]:
sns.histplot(high_product_group)

In [None]:
sns.histplot(low_product_group)

In [None]:
products_by_label_group[products_by_label_group == 2]

## Title

In [None]:
df_train.title

- What the titles for each product in the same label_group?

In [None]:
df_train['title'][df_train['label_group'] == 3627744656]

#### Number of Words

In [None]:
df_train['word_count'] = df_train['title'].apply(lambda x: len(str(x).split(" ")))
df_train[['title','word_count']].head()

#### Number of characters

In [None]:
df_train['char_count'] = df_train['title'].str.len() ## this also includes spaces
df_train[['title','char_count']].head()

#### Average Word Length

In [None]:
def avg_word(sentence):
  words = sentence.split()
  return (sum(len(word) for word in words)/len(words))

df_train['avg_word'] = df_train['title'].apply(lambda x: avg_word(x))
df_train[['title','avg_word']].head()

#### Number of stopwords

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

df_train['stopwords'] = df_train['title'].apply(lambda x: len([x for x in x.split() if x in stop]))
df_train[['title','stopwords']].head()

#### Number of special characters

In [None]:
df_train['hastags'] = df_train['title'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))
df_train[['title','hastags']].head()

#### Number of numerics

In [None]:
df_train['numerics'] = df_train['title'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
df_train[['title','numerics']].head()

#### Number of Uppercase words

In [None]:
df_train['upper'] = df_train['title'].apply(lambda x: len([x for x in x.split() if x.isupper()]))
df_train[['title','upper']].head()

#### Common word

In [None]:
freq = pd.Series(' '.join(df_train['title']).split()).value_counts()[:10]
freq

#### Common word in same label_group

- What is the common words (most frequent word) in product's title in the same label group?

In [None]:
freq = pd.Series(' '.join(df_train['title'][df_train['label_group'] == 3627744656]).split()).value_counts()[:10]
freq

#### Rare words

In [None]:
freq = pd.Series(' '.join(df_train['title']).split()).value_counts()[-10:]
freq

#### Rare words in same label_group

In [None]:
freq = pd.Series(' '.join(df_train['title'][df_train['label_group'] == 3627744656]).split()).value_counts()[-10:]
freq

## image

In [None]:
len(df_train.image.unique())

In [None]:
import os
files_train = os.listdir('../input/shopee-product-matching/train_images/')
print("training images files: ", files_train[:5])
print("Number of images: ", len(files_train))

In [None]:
import cv2


def show_image(files, nr, nc):
    figure, ax = plt.subplots(nrows=nr,ncols=nc,figsize=(10, 10))
    for i in range(0,nr*nc):
        image = cv2.imread('../input/shopee-product-matching/train_images/' + files[i])
        ax.ravel()[i].imshow(image)
        plt.xticks([])
        plt.yticks([])
        ax.ravel()[i].set_axis_off()
    plt.show()
    
    

In [None]:
show_image(files_train, 6, 6)

### High label_group product's image

- How many images related to one product?

In [None]:
top_products = pd.Series(df_train['image'][df_train['label_group'] == 3627744656]).array
len(top_products)

In [None]:
show_image(top_products, 6, 6)

### Low label_group product's images

In [None]:
low_products = pd.Series(df_train['image'][df_train['label_group'] == 834066355]).array

In [None]:
show_image(low_products, 3, 3)

## Summary and Conclusion

In conclusion, we explore the dataset to help us understand it. thus, will make it easer for us to start the next step of the project. In this Dataset, less than 5% of label_group have more than 10 products. And over 6979 label_group have 2 image of product. This will make the analysis more challenging in way to build a good model.