In [2]:
import time
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import Row

# Initialize Spark session
spark = SparkSession.builder.appName("HateTweetDetection").getOrCreate()

# Start timer
start_time = time.time()

# Load dataset
df = spark.read.csv("twitter.csv", header=True, inferSchema=True)

# Select necessary columns
df = df.select(col("tweet"), col("label"))  # label: 0 = not hate, 1 = hate

# Split into training and testing sets
train_df, test_df = df.randomSplit([0.8, 0.2], seed=42)

# Create pipeline stages
tokenizer = Tokenizer(inputCol="tweet", outputCol="words")
remover = StopWordsRemover(inputCol="words", outputCol="filtered_words")
hashing_tf = HashingTF(inputCol="filtered_words", outputCol="raw_features", numFeatures=10000)
idf = IDF(inputCol="raw_features", outputCol="features")
lr = LogisticRegression(featuresCol="features", labelCol="label")

# Assemble pipeline
pipeline = Pipeline(stages=[tokenizer, remover, hashing_tf, idf, lr])

# Fit model
model = pipeline.fit(train_df)

# Evaluate on test data
predictions = model.transform(test_df)
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print(f"\n Test Accuracy: {accuracy:.2f}")

# End training timer
end_time = time.time()
print(f"⏱️ Model Training Time: {end_time - start_time:.2f} seconds\n")

# Loop for user input
while True:
    print("\nOptions:")
    print("1. Enter tweet for prediction")
    print("2. Quit")
    choice = input("Enter your choice (1 or 2): ")

    if choice == '2':
        print("Exiting program. Goodbye!")
        break
    elif choice == '1':
        user_tweet = input("Type your tweet: ")
        user_df = spark.createDataFrame([Row(tweet=user_tweet)])
        prediction = model.transform(user_df).select("tweet", "prediction").collect()[0]

        print(f"\nTweet: {prediction['tweet']}")
        print(f"Prediction: {'Hate Tweet' if prediction['prediction'] == 1.0 else 'Not Hate Tweet'}")
    else:
        print("Invalid input. Please enter 1 or 2.")

# Stop Spark session
spark.stop()



✅ Test Accuracy: 0.89
⏱️ Model Training Time: 15.29 seconds


Options:
1. Enter tweet for prediction
2. Quit
Enter your choice (1 or 2): 1
Type your tweet: @user #cnn calls #michigan middle school 'build the wall' chant '' #tcot  

Tweet: @user #cnn calls #michigan middle school 'build the wall' chant '' #tcot  
Prediction: Hate Tweet

Options:
1. Enter tweet for prediction
2. Quit
Enter your choice (1 or 2):  factsguide: society now    #motivation
Invalid input. Please enter 1 or 2.

Options:
1. Enter tweet for prediction
2. Quit
Enter your choice (1 or 2): 1
Type your tweet:  factsguide: society now    #motivation

Tweet:  factsguide: society now    #motivation
Prediction: Not Hate Tweet

Options:
1. Enter tweet for prediction
2. Quit
Enter your choice (1 or 2): 2
Exiting program. Goodbye!


In [None]:
''' Here’s a **line-by-line explanation** of your PySpark code for **hate tweet detection**, along with relevant **theory** and a list of **external viva questions with answers** that you can expect.

---

### ✅ Full Code Explanation with Theory

```python
import time
```
- Imports the `time` module to measure execution time.
- **Theory**: Helps benchmark model performance and total runtime.

---

```python
from pyspark.sql import SparkSession
```
- Imports Spark session, which is the entry point for DataFrame and SQL functionalities in PySpark.
- **Theory**: SparkSession is used to initialize Spark and handle distributed computation.

---

```python
from pyspark.sql.functions import col
```
- Imports `col` function to access column values more cleanly in DataFrame transformations.

---

```python
from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF
```
- Imports NLP preprocessing tools:
  - `Tokenizer`: Splits sentences into words (tokens).
  - `StopWordsRemover`: Removes common words like "the", "is", "and".
  - `HashingTF`: Converts words to fixed-size numeric feature vectors.
  - `IDF`: Improves weighting of important words.

---

```python
from pyspark.ml.classification import LogisticRegression
```
- Imports a simple but powerful classifier that works well for binary classification tasks (hate or not hate).

---

```python
from pyspark.ml import Pipeline
```
- Pipeline allows chaining of preprocessing steps and model training into one reusable workflow.

---

```python
spark = SparkSession.builder.appName("HateTweetDetection").getOrCreate()
```
- Initializes the Spark environment with the name "HateTweetDetection".

---

```python
start_time = time.time()
```
- Starts recording time to calculate how long the code takes to run.

---

```python
df = spark.read.csv("twitter.csv", header=True, inferSchema=True)
```
- Loads the dataset `twitter.csv` into a DataFrame with automatic type inference.
- **Theory**: CSV reading is a common task in data engineering; header=True ensures column names are preserved.

---

```python
df = df.select(col("tweet"), col("label"))
```
- Selects only the `tweet` and `label` columns. The label should be 0 (non-hate) or 1 (hate).

---

```python
train_df, test_df = df.randomSplit([0.8, 0.2], seed=42)
```
- Randomly splits the dataset into 80% training and 20% testing for evaluation purposes.
- **Theory**: Data splitting is crucial to prevent overfitting and ensure generalization.

---

```python
tokenizer = Tokenizer(inputCol="tweet", outputCol="words")
```
- Tokenizes the tweets (converts sentence to list of words).

---

```python
remover = StopWordsRemover(inputCol="words", outputCol="filtered_words")
```
- Removes common stop words that don't add meaning.

---

```python
hashing_tf = HashingTF(inputCol="filtered_words", outputCol="raw_features", numFeatures=10000)
```
- Transforms filtered words into fixed-length feature vectors using hashing.
- **Theory**: HashingTF is efficient and avoids the need to build a vocabulary.

---

```python
idf = IDF(inputCol="raw_features", outputCol="features")
```
- Computes Inverse Document Frequency, which down-weights frequently occurring words.

---

```python
lr = LogisticRegression(featuresCol="features", labelCol="label")
```
- Creates the logistic regression model using the transformed features and actual labels.

---

```python
pipeline = Pipeline(stages=[tokenizer, remover, hashing_tf, idf, lr])
```
- Assembles all the preprocessing and model steps into a pipeline.

---

```python
model = pipeline.fit(train_df)
```
- Fits (trains) the entire pipeline on the training data.

---

```python
predictions = model.transform(test_df)
```
- Applies the trained model on the test data to generate predictions.

---

```python
end_time = time.time()
execution_time = end_time - start_time
print(f"Execution Time: {execution_time:.2f} seconds")
```
- Stops the timer and prints the total time taken to train and predict.

---

```python
predictions.select("tweet", "prediction").show(10)
```
- Displays 10 predictions showing tweet text and whether it's hate or not (`0` or `1`).

---

```python
spark.stop()
```
- Properly shuts down the Spark session and releases resources.

---

### 📚 THEORETICAL CONCEPTS INVOLVED

- **Natural Language Processing (NLP)**: Processing human language data using computers.
- **TF-IDF**: Gives importance to words based on their frequency across documents.
- **Logistic Regression**: Binary classifier used here to detect hate speech.
- **Spark ML Pipeline**: Streamlines data preparation and model training.
- **Train-Test Split**: Used to evaluate how well the model generalizes.

---

### ❓ Possible Viva Questions & Answers

| **Question** | **Answer** |
|--------------|------------|
| What is the goal of your project? | To classify tweets as hate or non-hate based on their text using PySpark ML. |
| Why did you use logistic regression? | It's efficient and works well for binary classification tasks. |
| What is the purpose of Tokenizer? | To break tweets into individual words (tokens) for processing. |
| What does StopWordsRemover do? | Removes common words that do not add value to classification. |
| Why use HashingTF instead of CountVectorizer? | HashingTF is faster and more memory-efficient as it doesn't store vocabulary. |
| What does IDF do? | It down-weights frequent words and gives importance to rare words. |
| What is a pipeline in PySpark? | A workflow that chains together multiple data transformation and model training steps. |
| What is the silhouette score? | (For clustering only) It measures how well samples are clustered. Not used here. |
| What would happen if we didn’t split data? | The model might overfit and perform poorly on unseen data. |
| What if a tweet contains a hashtag or emoji? | These are tokenized, but may need preprocessing like removing special characters. |

---

Would you like me to now include the **interactive loop version** where the user can type in their own tweets and get predictions continuously until they type "quit"? '''