In [113]:
!pip install pandas --quiet

#### Task 1: Load and Understand the Dataset 
- Load the dataset using pandas 
- Display the first 5 and last 5 rows 
- Check dataset shape, column names, and data types 
- Explain what each column represents

In [114]:
import pandas as pd

In [115]:
# Loading dataset using pandas
df = pd.read_csv("Dataset.csv")

In [116]:
# display first 5 rows
df_one = df.copy()
df_one.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


In [117]:
# display last 5 rows
df_one.tail()

Unnamed: 0,text,label
24995,A hit at the time but now better categorised a...,1
24996,I love this movie like no other. Another time ...,1
24997,This film and it's sequel Barry Mckenzie holds...,1
24998,'The Adventures Of Barry McKenzie' started lif...,1
24999,The story centers around Barry McKenzie who mu...,1


In [118]:
# Check dataset shape
df_one.shape

(25000, 2)

In [119]:
# Check dataset column names
df_one.columns

Index(['text', 'label'], dtype='str')

In [120]:
# Check data types of various columns
df_one.info()

<class 'pandas.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   text    25000 non-null  str  
 1   label   25000 non-null  int64
dtypes: int64(1), str(1)
memory usage: 390.8 KB


**Explain what each column represents**

text: Text column is a String data type. It has 25000 not-null data. It contains all the reviews.

label: Label column is an Integer data type. It has also 25000 non-null data. Label column has a categorical data. 


#### Task 2: Mini Exploratory Data Analysis (EDA) 
- Count the number of positive and negative reviews 
- Create two new columns: 
  - text_length: number of characters 
  - word_count: number of words 
- Find the review with the maximum word count 

In [123]:
df_two = df.copy()

In [126]:
# Count the number of positive and negative reviews
positive = df_two["label"].value_counts()[1]
negative = df_two["label"].value_counts()[0]
print(f"Total Positive reviews: {positive}")
print(f"Total Negative reviews: {negative}")

Total Positive reviews: 12500
Total Negative reviews: 12500


In [None]:
# Create two new columns: 
#  - text_length: number of characters 
#  - word_count: number of words 
df_two["text_length"] = df_two["text"].apply(len)
df_two["word_count"] = df_two["text"].apply(lambda txt : len(txt.split()))
df_two[["text", "text_length", "word_count"]].head()


Unnamed: 0,text,text_length,word_count
0,I rented I AM CURIOUS-YELLOW from my video sto...,1640,288
1,"""I Am Curious: Yellow"" is a risible and preten...",1294,214
2,If only to avoid making this type of film in t...,528,93
3,This film was probably inspired by Godard's Ma...,706,118
4,"Oh, brother...after hearing about this ridicul...",1814,311


In [None]:
# Find the review with the maximum word count
max_count = df_two["word_count"].idxmax()
longest_review = df_two.loc[max_count, 'text']
print(f"Longest review: {longest_review}")
print(f"Word count of this review: {max_count}")

Longest review: Match 1: Tag Team Table Match Bubba Ray and Spike Dudley vs Eddie Guerrero and Chris Benoit Bubba Ray and Spike Dudley started things off with a Tag Team Table Match against Eddie Guerrero and Chris Benoit. According to the rules of the match, both opponents have to go through tables in order to get the win. Benoit and Guerrero heated up early on by taking turns hammering first Spike and then Bubba Ray. A German suplex by Benoit to Bubba took the wind out of the Dudley brother. Spike tried to help his brother, but the referee restrained him while Benoit and Guerrero ganged up on him in the corner. With Benoit stomping away on Bubba, Guerrero set up a table outside. Spike dashed into the ring and somersaulted over the top rope onto Guerrero on the outside! After recovering and taking care of Spike, Guerrero slipped a table into the ring and helped the Wolverine set it up. The tandem then set up for a double superplex from the middle rope which would have put Bubba throug

#### Task 3: Text Cleaning Using Regex 
Apply the following preprocessing steps using regex: 
- Remove URLs 
- Remove numbers 
- Remove extra spaces

In [None]:
import re

In [132]:
df_three = df_two.copy()

In [148]:
def clean_text(text):
    text = re.sub(r"https?://\S+|www\.\S+", "", text)
    text = re.sub(r"\d+", "", text)
    text = re.sub(r"\s+", " ", text)

    return text.strip()

df_three["cleaned_text"] = df_three["text"].apply(clean_text)
df_three[["text", "cleaned_text"]].head()

Unnamed: 0,text,cleaned_text
0,I rented I AM CURIOUS-YELLOW from my video sto...,I rented I AM CURIOUS-YELLOW from my video sto...
1,"""I Am Curious: Yellow"" is a risible and preten...","""I Am Curious: Yellow"" is a risible and preten..."
2,If only to avoid making this type of film in t...,If only to avoid making this type of film in t...
3,This film was probably inspired by Godard's Ma...,This film was probably inspired by Godard's Ma...
4,"Oh, brother...after hearing about this ridicul...","Oh, brother...after hearing about this ridicul..."
