# m.5.Assignment -> Spooky authorship identification via Apache Spark


| **Overview** | Use Apache Spark and machine learning to determine sentence authorship labels. |
|----------|--------------------------------------------------------------------------------|
| **Data** | [Dark, ominous, and introspective](https://www.kaggle.com/competitions/spooky-author-identification/code) |

![image.png](attachment:image.png)

## **DETAILS**
#### **Dataset Description:** 
The spooky author identification dataset contains text from works of fiction written by spooky authors of the public domain: Edgar Allan Poe, HP Lovecraft and Mary Shelley. The data was prepared by chunking larger texts into sentences using CoreNLP's MaxEnt sentence tokenizer resulting in an odd non-sentence here and there. Your objective is to accurately identify the author of the sentences in the test set.
- **id** - unique identifier for each sentence
- **text** - sentence written by one of the authors
- **author** - {EAP:Edgar Allan Poe}, {HPL:HP Lovecraft}; {MWS:Mary Wollstonecraft Shelley}
#### **Objective:**
- A. Accurately identify the author of the sentences in the test set.
- B. Perform ALL work using Apache Spark.
#### **Dataset:**
- Training consists of passages with an author label.
- Test has sentences with no author labels.
#### **Competition Evaluation:**
The submissions were evaluated based on multi-class logarithmic loss. The logarithmic loss assesses the uncertainty of the predicted probabilities, penalizing confident incorrect predictions. Lower log loss values indicated better performance. 
#### **Approach:**
NLP techniques + machine learning algorithms. Feature engineering like bag-of-words, TF-IDF, word embeddings/Word2Vec. Perform algorithmic work with logistic regression, support vector machines, neural networks, and as appropriate.

## **TASKS**


### **Stage 0: Import Data**

1. Create a code notebook called: code_6_of_10_data_mine_<your_name>.ipynb
2. Load data into Spark data objects and explore structure, size, and distribution of information.


In [2]:
# The notebook already created.
# Load data into Spark data objects and explore structure, size and distribution of information
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("SpookyAuthor").master("local[*]").getOrCreate()

# Load the train.csv & test.csv dataset into a Spark DataFrame
train_df = spark.read.csv("data/train.csv", header=True, inferSchema=True)
test_df = spark.read.csv("data/test.csv", header=True, inferSchema=True)

# Show the first few rows of the train and test DataFrame
print("Train DataFrame:\n")
train_df.show(3)

print("\nTest DataFrame:\n")
test_df.show(3)


Train DataFrame:

+-------+--------------------+------+
|     id|                text|author|
+-------+--------------------+------+
|id26305|This process, how...|   EAP|
|id17569|It never once occ...|   HPL|
|id11008|In his left hand ...|   EAP|
+-------+--------------------+------+
only showing top 3 rows

Test DataFrame:

+-------+--------------------+
|     id|                text|
+-------+--------------------+
|id02310|Still, as I urged...|
|id24541|If a fire wanted ...|
|id00134|And when they had...|
+-------+--------------------+
only showing top 3 rows
