# **Data preparation for cloud computing projet**


## **Purpose of this Notebook:**

This notebook aims to prepare the dataset containing Amazon reviews for a cloud computing project focused on sentiment analysis using Azure services.

## **Project Overview:**
The goal of this project is to analyze Amazon reviews stored in a CSV file and leverage Azure services to extract sentiment insights. The analyzed data will be stored and visualized using Azure and Power BI.

In [1]:
import pandas as pd 

#Reading the csv file imported from Kaggle
df = pd.read_csv('test.csv')

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 399999 entries, 0 to 399998
Data columns (total 3 columns):
 #   Column                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Non-Null Count   Dtype 
---  ------                                                                                                                                                                                                                                                                                                                                     

In [3]:
df.shape

(399999, 3)

In [4]:
df.head()

Unnamed: 0,2,Great CD,"My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I'm in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life's hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing ""Who was that singing ?"""
0,2,One of the best game music soundtracks - for a...,Despite the fact that I have only played a sma...
1,1,Batteries died within a year ...,I bought this charger in Jul 2003 and it worke...
2,2,"works fine, but Maha Energy is better",Check out Maha Energy's website. Their Powerex...
3,2,Great for the non-audiophile,Reviewed quite a bit of the combo players and ...
4,1,DVD Player crapped out after one year,I also began having the incorrect disc problem...


In [5]:
print(df.columns)
print(len(df.columns))

Index(['2', 'Great CD',
       'My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I'm in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life's hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing "Who was that singing ?"'],
      dtype='object')
3


In [6]:
# I want to use this dataset as a source for my data analysis project.
# The dataset is very voluminous, it contains almost 400k rows 
# I will use the first 1000 rows and keep the same columns 
# I will save this new dataset as a new csv file with two versions, one with labels named data_with_labels.csv
# and one without labels named data_without_labels.csv
# my datasets will have 3 columns
# the first column will be named title that corresponds to the second column of the original dataset
# the second column will be named text that corresponds to the third column of the original dataset
# the third column will be named label that corresponds to the first column of the original dataset

print(df.columns)

# Renaming the columns based on the actual column names in the dataset
df = df.rename(columns={
    df.columns[1]: 'title',
    df.columns[2]: 'text',
    df.columns[0]: 'label'
})

print(df.columns)

# Selecting the first 1000 rows
data_with_labels= df.iloc[:1000]

#rearranging the columns
data_with_labels = data_with_labels[['title', 'text', 'label']]

# Saving the new dataset to a CSV file
data_with_labels.to_csv('data_with_labels.csv', index=False)

print(data_with_labels.head())

print(data_with_labels.shape)

print(data_with_labels.columns)

Index(['2', 'Great CD',
       'My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I'm in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life's hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing "Who was that singing ?"'],
      dtype='object')
Index(['label', 'title', 'text'], dtype='object')
                                               title  \
0  One of the best game music soundtracks - for a...   
1                   Batteries died within a year ...   
2              works fine, but Maha Energy is better   
3                       Great for the non-audiophile   
4              DVD Player crapped out after one year   

                                

In [7]:
#Making sure that there are no missing values in the dataset
print(data_with_labels.isnull().sum())

#dropping the missing values
data_with_labels = data_with_labels.dropna()

#Making sure that there are no missing values in the dataset
print(data_with_labels.isnull().sum())


title    1
text     0
label    0
dtype: int64
title    0
text     0
label    0
dtype: int64


In [8]:
#making sure that there are no duplicates in the dataset
print(data_with_labels.duplicated().sum())

0


In [9]:
#Creating the dataset without labels
data_without_labels = data_with_labels.drop('label', axis=1)

# Saving the new dataset to a CSV file
data_without_labels.to_csv('data_without_labels.csv', index=False)

print(data_without_labels.head())

                                               title  \
0  One of the best game music soundtracks - for a...   
1                   Batteries died within a year ...   
2              works fine, but Maha Energy is better   
3                       Great for the non-audiophile   
4              DVD Player crapped out after one year   

                                                text  
0  Despite the fact that I have only played a sma...  
1  I bought this charger in Jul 2003 and it worke...  
2  Check out Maha Energy's website. Their Powerex...  
3  Reviewed quite a bit of the combo players and ...  
4  I also began having the incorrect disc problem...  


**NOW MY DATA IS READY**