# Introduction

We have a dataset for reviews on an applications used for schedule planning and to-dos. As a Data Science Intern, we will perform necessary data pre-processing and analysis to get insisghts on the data. Final goal is to derive certain results on sentiment analysis for the reviews on the applications.

We will start with performing EDA on the data. For that, let's load the data and create our dataframe :

In [1]:
#libraries & packages
import  pandas as pd
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Ankur\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Ankur\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
# read csv file
reviewData = pd.read_csv("US_project.csv")

# showing data 
reviewData.sample(5)

Unnamed: 0,reviewId,userName,userImage,content,thumbsUpCount,reviewCreatedVersion,at,replyContent,repliedAt,appVersion,sortOrder,appId
9135,fb2901aa-57fe-4b3f-8c30-a199732fc615,Kay-Uwe Wagner,https://play-lh.googleusercontent.com/a-/ALV-U...,"This app should get a ""-1"". Almost all task li...",13,2.114.690.02,25-02-2024 16:01,,,2.114.690.02,newest,com.microsoft.todos
15263,4799ec89-bdfc-4faf-a6a2-b6009591f03c,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,"this app had been good to me for years, but th...",0,2.49.0,21-01-2024 20:06,"Hi, which view do you refer to? Can you maybe ...",23-01-2024 09:11,2.49.0,most_relevant,com.appgenix.bizcal
8385,99d2117a-01bc-41d6-911f-13627d59d609,josiah wako,https://play-lh.googleusercontent.com/a/ACg8oc...,I like it because it is simple to and unlike s...,0,1.8.0,10-11-2021 20:05,,,1.8.0,most_relevant,com.habitnow
13463,40bf7135-3c38-49c1-84f7-a09a6ac3d747,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,Tasks that repeat more than once daily dont wo...,19,1.31,20-08-2019 03:10,"Hi Carlton, Tasks will always honour the sched...",20-08-2019 07:33,1.31,most_relevant,com.tasks.android
6379,a5e915fd-ce9d-4fdb-8a77-931e3b4aa92e,Glenn Weaver,https://play-lh.googleusercontent.com/a-/ALV-U...,"Like others have said in here, the widget free...",2,1.5.15,19-08-2020 17:11,,,1.5.15,newest,com.oristats.habitbull


In [3]:
 # Also, let's check rows & columns of the dataframe 
reviewData.shape

(16787, 12)

In [4]:
# Also some basic information of dataset like memory uage and data types of columns 
reviewData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16787 entries, 0 to 16786
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   reviewId              16787 non-null  object
 1   userName              16787 non-null  object
 2   userImage             16787 non-null  object
 3   content               16786 non-null  object
 4   thumbsUpCount         16787 non-null  int64 
 5   reviewCreatedVersion  14430 non-null  object
 6   at                    16787 non-null  object
 7   replyContent          9168 non-null   object
 8   repliedAt             9168 non-null   object
 9   appVersion            14430 non-null  object
 10  sortOrder             16787 non-null  object
 11  appId                 16787 non-null  object
dtypes: int64(1), object(11)
memory usage: 1.5+ MB


# Data Pre-processing

## Removing Irrelevant Data

As we can see from the sample data of the dataframe, the **userImage** column consisting of URLs has no relevancy to our analysis so it is better to get rid of it and reduce the size of our dataframe to help aid faster processing. Further the columns **reviewCreatedVersion** & **appVersion** have the same values creating redundancy so one of it will be removed as well.

**_Note_**: Keep an eye on the memory usage of data frame. It will be reducing as we clean our data further.

In [5]:
# removing irrevalent data
reviewData.drop(columns=['userImage','reviewCreatedVersion'],axis=1,inplace=True)

reviewData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16787 entries, 0 to 16786
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   reviewId       16787 non-null  object
 1   userName       16787 non-null  object
 2   content        16786 non-null  object
 3   thumbsUpCount  16787 non-null  int64 
 4   at             16787 non-null  object
 5   replyContent   9168 non-null   object
 6   repliedAt      9168 non-null   object
 7   appVersion     14430 non-null  object
 8   sortOrder      16787 non-null  object
 9   appId          16787 non-null  object
dtypes: int64(1), object(9)
memory usage: 1.3+ MB


## Fixing Datatypes

In [6]:
reviewData.describe()

Unnamed: 0,thumbsUpCount
count,16787.0
mean,9.658962
std,32.028656
min,0.0
25%,0.0
50%,1.0
75%,6.0
max,1951.0


**_info()_** shows most of the colume have a **object** datatype which is default datatype for non numeric values and one of it is **int64** datatype but using **_describe()_** we see that the maximum value for the integer datatype is _1951_ so **int16** should be enough.

In [7]:
#converting datatype for the int column
reviewData['thumbsUpCount'] = reviewData['thumbsUpCount'].astype('int16')
reviewData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16787 entries, 0 to 16786
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   reviewId       16787 non-null  object
 1   userName       16787 non-null  object
 2   content        16786 non-null  object
 3   thumbsUpCount  16787 non-null  int16 
 4   at             16787 non-null  object
 5   replyContent   9168 non-null   object
 6   repliedAt      9168 non-null   object
 7   appVersion     14430 non-null  object
 8   sortOrder      16787 non-null  object
 9   appId          16787 non-null  object
dtypes: int16(1), object(9)
memory usage: 1.2+ MB


Further, from all of the other columns, column **_at_** & **_repliedAt_** can be better represented as **datetime** datatype

In [8]:
datetime_colms = ['at','repliedAt']
for c in datetime_colms:
    reviewData[c] = pd.to_datetime(reviewData[c], dayfirst=True)
reviewData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16787 entries, 0 to 16786
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   reviewId       16787 non-null  object        
 1   userName       16787 non-null  object        
 2   content        16786 non-null  object        
 3   thumbsUpCount  16787 non-null  int16         
 4   at             16787 non-null  datetime64[ns]
 5   replyContent   9168 non-null   object        
 6   repliedAt      9168 non-null   datetime64[ns]
 7   appVersion     14430 non-null  object        
 8   sortOrder      16787 non-null  object        
 9   appId          16787 non-null  object        
dtypes: datetime64[ns](2), int16(1), object(7)
memory usage: 1.2+ MB


Moreover, on checking for possible values of **_sortOrder_** & **_appId_** columns, it appears that they have limited set of values available as shown below. So these are perfect examples of **category** datatype.

In [9]:
reviewData['sortOrder'].unique()

array(['most_relevant', 'newest'], dtype=object)

In [10]:
reviewData['appId'].unique()

array(['com.anydo', 'com.todoist', 'com.ticktick.task',
       'com.habitrpg.android.habitica', 'cc.forestapp',
       'com.oristats.habitbull', 'com.levor.liferpgtasks', 'com.habitnow',
       'com.microsoft.todos', 'prox.lab.calclock',
       'com.gmail.jmartindev.timetune', 'com.artfulagenda.app',
       'com.tasks.android', 'com.appgenix.bizcal', 'com.appxy.planner'],
      dtype=object)

In [11]:
# Coverting their datatypes
categorical_colms = {'sortOrder':'category', 'appId':'category'}
reviewData = reviewData.astype(categorical_colms)
reviewData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16787 entries, 0 to 16786
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   reviewId       16787 non-null  object        
 1   userName       16787 non-null  object        
 2   content        16786 non-null  object        
 3   thumbsUpCount  16787 non-null  int16         
 4   at             16787 non-null  datetime64[ns]
 5   replyContent   9168 non-null   object        
 6   repliedAt      9168 non-null   datetime64[ns]
 7   appVersion     14430 non-null  object        
 8   sortOrder      16787 non-null  category      
 9   appId          16787 non-null  category      
dtypes: category(2), datetime64[ns](2), int16(1), object(5)
memory usage: 984.5+ KB


Finally, all other columns fit to be a **string** datatype which will enable us to perform string operations if needed in future for any analysis.

In [12]:
string_colms = {'reviewId': 'string',
                'userName': 'string',
               'content' : 'string',
               'replyContent': 'string',
               'appVersion': 'string'}
reviewData = reviewData.astype(string_colms)
reviewData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16787 entries, 0 to 16786
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   reviewId       16787 non-null  string        
 1   userName       16787 non-null  string        
 2   content        16786 non-null  string        
 3   thumbsUpCount  16787 non-null  int16         
 4   at             16787 non-null  datetime64[ns]
 5   replyContent   9168 non-null   string        
 6   repliedAt      9168 non-null   datetime64[ns]
 7   appVersion     14430 non-null  string        
 8   sortOrder      16787 non-null  category      
 9   appId          16787 non-null  category      
dtypes: category(2), datetime64[ns](2), int16(1), string(5)
memory usage: 984.5 KB


### Question  :
As we see there are many null values, what will be the best way to handle them ?

Let's begin with other pre-processing assigned for the project. 

In [13]:
reviewData.sample(5)

Unnamed: 0,reviewId,userName,content,thumbsUpCount,at,replyContent,repliedAt,appVersion,sortOrder,appId
1223,680517f1-0117-4b44-b025-b25f6af76077,Mr. Yusuf,The app is too complicated and has too many fe...,5,2023-03-18 07:17:00,Hi Yusuf! We are sorry to hear this 😞 We would...,2023-03-21 04:05:00,v10770,most_relevant,com.todoist
13096,f192634e-ce18-4101-8512-b96c95fa6d5c,Carl and Elly,I have been using Artful for a little over 2 y...,0,2024-03-14 18:30:00,Thank you so much Carl and Elly for your revie...,2024-03-16 21:31:00,,newest,com.artfulagenda.app
10277,078ab3a7-0028-4911-a25b-6afd0f080957,Brettany Daniels,Sounds like a wonderful idea for time manageme...,0,2020-07-13 19:40:00,"Hi, Make sure that all the calendars you need ...",2020-07-13 22:24:00,,most_relevant,prox.lab.calclock
16739,b22618f6-cbeb-4490-8a52-3b21395ccc51,Vanessa Castro,Planner Pro is simple and efficient.,1,2023-11-05 14:54:00,,NaT,6.2,newest,com.appxy.planner
11359,d5b36622-0757-4f7a-810d-92582509e07c,Carin Basson,Wow. That was a quick fix! The button toggle d...,0,2024-02-18 13:45:00,Hello. We'll fix it. Alternatively: Go to the ...,2024-02-18 13:42:00,5.27.2,newest,prox.lab.calclock


In [14]:
reviewData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16787 entries, 0 to 16786
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   reviewId       16787 non-null  string        
 1   userName       16787 non-null  string        
 2   content        16786 non-null  string        
 3   thumbsUpCount  16787 non-null  int16         
 4   at             16787 non-null  datetime64[ns]
 5   replyContent   9168 non-null   string        
 6   repliedAt      9168 non-null   datetime64[ns]
 7   appVersion     14430 non-null  string        
 8   sortOrder      16787 non-null  category      
 9   appId          16787 non-null  category      
dtypes: category(2), datetime64[ns](2), int16(1), string(5)
memory usage: 984.5 KB


In [15]:
# So there is one row where content is null, let's remove that row entirely

reviewData = reviewData.dropna(subset=['content'])
reviewData.shape

(16786, 10)

In [16]:
# Convert to lower case
reviewData['content'] = reviewData['content'].str.lower()

reviewData.sample(5)

Unnamed: 0,reviewId,userName,content,thumbsUpCount,at,replyContent,repliedAt,appVersion,sortOrder,appId
299,1a1e2773-e082-4df8-8c45-06074788f6ff,Fernando Hidalgo,ux is poor like most of these app. the growing...,32,2020-11-10 17:22:00,We work hard to make sure that Any.do's UX is ...,2020-11-11 10:09:00,5.6.0.8,most_relevant,com.anydo
7878,a6e1f447-458a-4b8a-854d-8d946289292b,Ashutosh Vajpayee,i wish i could give zero stars..what nonsense ...,6,2022-01-17 03:12:00,,NaT,1.9.1,most_relevant,com.habitnow
6751,a8d2e75a-252f-4e6c-b1e8-ccf1a5835fe9,Diego Gonzalo Jara Llanos,not totally friendly environment,0,2020-05-31 13:11:00,,NaT,1.5.11,newest,com.oristats.habitbull
11608,adae6613-3feb-4eb9-a61b-84012e527c02,A Google user,i was recommended this app by a friend who lov...,2,2018-10-02 01:44:00,Hi Makiya. We're sorry you feel that way. The ...,2018-10-02 08:56:00,2.5,most_relevant,com.gmail.jmartindev.timetune
13720,aec04525-4d11-493d-94bb-1b0f92372d09,A Google user,i couldn't find a way to link to google tasks....,1,2019-08-08 00:39:00,"Hi Stan, I do not claim this feature on the Pl...",2019-08-08 07:43:00,1.3,most_relevant,com.tasks.android


In [17]:
# Removing links using regex match
reviewData['content'] = reviewData['content'].apply(lambda x: re.sub(r'http\S+', '', x))

In [18]:
# Removing new lines using regex
reviewData['content'] = reviewData['content'].replace('\n', '', regex=True)

In [19]:
# removing alpha-numeric words
reviewData['content'] = reviewData['content'].apply(lambda x: ' '.join([word for word in x.split() if not any(char.isdigit() for char in word)]))

In [20]:
# removing extra spaces from end as well as from in between
reviewData['content'] = reviewData['content'].apply(lambda x: ' '.join(x.strip().split()))

In [21]:
# removing any special characters using regex
reviewData['content'] = reviewData['content'].replace('[^a-zA-Z0-9\s]', '', regex=True)

In [22]:
#removing stop words
stop_words = set(stopwords.words('english'))
reviewData['content'] = reviewData['content'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))


In [23]:
# stemming
stemmer = PorterStemmer()
reviewData['content'] = reviewData['content'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))

In [24]:
#lemmatization
lemmatizer = WordNetLemmatizer()
reviewData['content'] = reviewData['content'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))

In [25]:
reviewData.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16786 entries, 0 to 16786
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   reviewId       16786 non-null  string        
 1   userName       16786 non-null  string        
 2   content        16786 non-null  object        
 3   thumbsUpCount  16786 non-null  int16         
 4   at             16786 non-null  datetime64[ns]
 5   replyContent   9167 non-null   string        
 6   repliedAt      9167 non-null   datetime64[ns]
 7   appVersion     14429 non-null  string        
 8   sortOrder      16786 non-null  category      
 9   appId          16786 non-null  category      
dtypes: category(2), datetime64[ns](2), int16(1), object(1), string(4)
memory usage: 1.1+ MB


In [26]:
reviewData.sample(5)

Unnamed: 0,reviewId,userName,content,thumbsUpCount,at,replyContent,repliedAt,appVersion,sortOrder,appId
10143,730c34f6-13be-4388-a321-09dbc68aaf6f,Zaid,simpl quick interfac aint smth youd expect mic...,0,2024-03-31 16:45:00,,NaT,2.114.690.01,newest,com.microsoft.todos
5353,8655713a-b37f-4677-8e15-ab692fad0368,Annu Suresh,look app like forest improv focu avoid distrac...,0,2020-05-07 10:13:00,"Hello, We require certain permission just to e...",2020-05-08 04:59:00,4.16.3,most_relevant,cc.forestapp
7310,ef43c63d-25ec-4756-abd0-45e2ffb17610,RemixV4,frustrat limit,1,2023-07-27 02:01:00,,NaT,22.6.0,newest,com.levor.liferpgtasks
11840,d0d5623e-129d-43d5-86c6-c5ecd13d880b,Uzoamaka Nnabude,download app need time tabl schedul notifi tim...,0,2022-10-18 15:16:00,Hi. The problem happens because Tecno devices ...,2022-10-18 14:20:00,4.5,most_relevant,com.gmail.jmartindev.timetune
188,7234afd1-0a66-4326-b841-446fd44839e4,Abhijeet Jayaswal,unabl assign task anyon whatsapp,0,2023-12-19 17:34:00,If you are a workspace member you should be ab...,2023-12-20 10:09:00,,newest,com.anydo


### Question : 
It is seen that the size of dataframe is increased from 984kB to 1.1+MB after performingabove techniques. Logically, we have removed a lot of stuffs from content, if any, or even if it is not, the size should have been either the same or reduced. What is the reason for the increment ?