### Task 1: Load and Read Text Files with Correct Encoding

In [4]:
import chardet

# Function to detect encoding
def detect_encoding(file_path):
    with open(file_path, 'rb') as f:
        raw_data = f.read()
    return chardet.detect(raw_data)['encoding']

# File paths
file_paths = ["reviews1.txt", "reviews2.txt", "review1.txt"]

# Load reviews into a list
reviews = []
for file_path in file_paths:
    encoding = detect_encoding(file_path)
    with open(file_path, "r", encoding=encoding) as f:
        file_reviews = f.readlines()
        reviews.extend([review.strip() for review in file_reviews if review.strip()])

# Print reviews to verify
for i, review in enumerate(reviews):
    print(f"Review {i+1}:\n{review}\n{'-'*50}")



Review 1:
Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly.
--------------------------------------------------
Review 2:
Bromwell High is nothing short of brilliant. Expertly scripted and perfectly delivered, this searing parody of a students and teachers at a South London Public School leaves you literally rolling with laughter. It's vulgar, provocative, witty and sharp. The characters are 

### Task 2: Text Preprocessing (Tokenization, Lowercasing, Removing Punctuation)

In [5]:
import re
from collections import Counter

# Function to preprocess text
def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^a-z\s]', '', text)  # Remove punctuation, numbers, and special characters
    words = text.split()  # Tokenization
    return words

# Apply preprocessing
preprocessed_reviews = [preprocess_text(review) for review in reviews]

# Flatten the list and count word occurrences
word_counts = Counter([word for review in preprocessed_reviews for word in review])

# Display the 10 most common words
print("Top 10 most common words:", word_counts.most_common(10))


Top 10 most common words: [('the', 45), ('of', 36), ('a', 30), ('is', 28), ('and', 27), ('i', 17), ('it', 14), ('to', 13), ('that', 11), ('in', 10)]


### Task 3: Stemming and Lemmatization

In [6]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet
import nltk

nltk.download('wordnet')
nltk.download('omw-1.4')

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Function to apply stemming
def apply_stemming(words):
    return [stemmer.stem(word) for word in words]

# Function to apply lemmatization
def apply_lemmatization(words):
    return [lemmatizer.lemmatize(word, wordnet.VERB) for word in words]

# Apply stemming and lemmatization
stemmed_reviews = [apply_stemming(review) for review in preprocessed_reviews]
lemmatized_reviews = [apply_lemmatization(review) for review in preprocessed_reviews]

# Compare first review's original, stemmed, and lemmatized versions
print("Original:", preprocessed_reviews[0][:20])
print("Stemmed:", stemmed_reviews[0][:20])
print("Lemmatized:", lemmatized_reviews[0][:20])


[nltk_data] Downloading package wordnet to
[nltk_data]     /home/rajubuntu/nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /home/rajubuntu/nltk_data...


Original: ['story', 'of', 'a', 'man', 'who', 'has', 'unnatural', 'feelings', 'for', 'a', 'pig', 'starts', 'out', 'with', 'a', 'opening', 'scene', 'that', 'is', 'a']
Stemmed: ['stori', 'of', 'a', 'man', 'who', 'ha', 'unnatur', 'feel', 'for', 'a', 'pig', 'start', 'out', 'with', 'a', 'open', 'scene', 'that', 'is', 'a']
Lemmatized: ['story', 'of', 'a', 'man', 'who', 'have', 'unnatural', 'feelings', 'for', 'a', 'pig', 'start', 'out', 'with', 'a', 'open', 'scene', 'that', 'be', 'a']


### Task 4: Stopword Removal

In [7]:
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Function to remove stopwords
def remove_stopwords(words):
    return [word for word in words if word not in stop_words]

# Apply stopword removal
filtered_reviews = [remove_stopwords(review) for review in lemmatized_reviews]

# Compare top words before and after stopword removal
filtered_word_counts = Counter([word for review in filtered_reviews for word in review])
print("Top 10 words after stopword removal:", filtered_word_counts.most_common(10))


Top 10 words after stopword removal: [('bromwell', 7), ('high', 7), ('teachers', 6), ('every', 6), ('film', 6), ('time', 5), ('school', 5), ('like', 5), ('br', 5), ('good', 4)]


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/rajubuntu/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
