[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)][collabURL]

# Module 0 - Homework 1 - Twitter Sentiment Analysis

**Author** : Victor Calderon


## Problem Statement

Airline industry had a very hard time post-covid to sustain their business
due to a long hault. It is very important for them to make sure they exceed
customer expectations. The best way to evaluate performance is customer
feedback. You are given a dataset of *airline tweets* from real customers.

A sentiment analysis job about the problems of each major U.S. airline.
Twitter data was scraped from *February of 2015* and contributors
were asked to first classify *positive*, *negative*, and *neutral* tweents,
followed by categorizing *negative reasons* (e.g. "late flight", or
"rude service").

You will use the text column and sentiment column to create a classification
model that classifies a given tween into one of the 3 classes:

- Positive
- Negative
- Neutral

## Understanding the dataset

Dataset contains many columns, out of which below are most important ones:

1. `airline_sentiment` : Defines the sentiment of the tweet
2. `negative_reason` : Reason for the negative feedback (if negative).
3. `Text` : Tweet text content
4. `tweet_location` : Location, from which the tweet was posted.

## Steps to perform

1. Load dataset: [link][datasetURL]
2. Clean, preprocess data and EDA.
3. Vectorise columns that contain text.
4. Run classification model to classify - *positive*, *negative*, or *neutral*.
5. Evaluate model.

<!-- Links -->

[jupyterNotebook]: https://github.com/hamzafarooq/maven-mlsystem-design-cohort-1/blob/main/Module-0/airline_tweet_sentiment.ipynb

[datasetURL]: https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment

[spacyURL]: https://spacy.io/

[kaggleURL]: https://www.kaggle.com/

[nltkURL]: https://www.nltk.org/

[collabURL]: https://colab.research.google.com/github/vcalderon2009/ML-System-Design-Course/blob/feature%2Fmodule-0-intro-to-nlp/modules/Module_0/Homework/01-Victor-Calderon-Module-0-Homework-1-Twitter-Sentiment-Analysis.ipynb

---

## 0.1 Loading modules

In [None]:
# Installing Kaggle, if necessary
!pip install -q kaggle

In [None]:
from pathlib import Path
import shutil
import os
import tempfile
import logging
from zipfile import ZipFile
from typing import Optional
import pandas as pd

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

## 0. Defining utility functions

Before we begin, we'll need to define some utility functions in order to
not clutter the notebook.

In [None]:
def create_temp_directory():
    """
    Function to create a temporary directory.
    
    Returns
    ---------
    temp_dir : pathlib.Path
        Path to the temporary directory
    """
    # Defining a temporary directory
    temp_dir = tempfile.mkdtemp()
    Path(temp_dir).mkdir(exist_ok=True, parents=True)
    
    return Path(temp_dir).resolve()

In [None]:
def run_system_command(cmd: str):
    """
    Function to run a system command
    """
    os.system(cmd)

In [None]:
def get_kaggle_credentials():
    """
    Function to get the API credentials for interacting with Kaggle.
    """
    # Determining which type of setting to use
    try:
        from google.colab import drive

        COLAB_WORKSPACE = True
    except:
        COLAB_WORKSPACE = False
    #
    if COLAB_WORKSPACE:
        # Asking the user to upload the `kaggle.json` credentials file
        from google.colab import files

        files.upload()
        # Creating directory to place the credentials
        kaggle_output_directory = Path("~/.kaggle").resolve()
        kaggle_output_directory.mkdir(exists=True, parents=True)
        # Copying credentials to kaggle directory
        kaggle_creds_output_filepath = kaggle_output_directory.joinpath(
            "kaggle.json"
        )
        shutil.copy("./kaggle.json", kaggle_creds_output_filepath)
        # Changing the permissions of the file
        kaggle_creds_output_filepath.chmod(600)


In [None]:
def get_kaggle_dataset_and_load(
    username: str,
    dataset_name: str,
    extension: Optional[str] = "csv",
    delete_file: Optional[bool] = False,
) -> pd.DataFrame:
    """
    Function to download a dataset from Kaggle.

    Parameters
    -----------
    username : str
        Username that owns the dataset.

    dataset_name : str
        Name of the dataset.

    extension : str, optional
        File extension of the uncompressed dataset.

    delete_file : bool, optional
        If ``True``, the uncompressed file will be deleted.
        This variable is set to ``False`` by default.

    Returns
    ----------
    dataset_df : pandas.DataFrame
        Dataset from Kaggle
    """
    # Creating temporary directory
    temp_dir = create_temp_directory()
    # Downloading the specified dataset
    cmd = f"kaggle datasets download {username}/{dataset_name}"
    logging.info(cmd)
    run_system_command(cmd=cmd)
    # Find the `.zip` file, move it to the temporary directory and unzip it
    dataset_zip_filepath = list(Path(".").resolve().glob("*.zip"))[0]
    # Decompressing file
    with ZipFile(str(dataset_zip_filepath), "r") as zfile:
        zfile.extractall(path=temp_dir)

    # Figuring out the path to the uncompressed file
    dataset_filepath = list(temp_dir.rglob(f"*.{extension}"))[0]
    logger.info(f"Uncompressed dataset: `{dataset_filepath}` downloaded!")

    # Reading in data
    read_func = {
        ".csv": pd.read_csv,
        ".parquet": pd.read_parquet,
        ".json": pd.read_json,
    }

    file_extension = dataset_filepath.suffix

    dataset_df = read_func[file_extension](dataset_filepath)

    # Deleting dataset
    logger.info(f"dataset_zip_filepath: {dataset_zip_filepath}")
    dataset_zip_filepath.unlink()

    if delete_file:
        dataset_filepath.unlink()

    return dataset_df


---

## 1. Downloading and loading the Kaggle dataset

The first step is to download the Kaggle dataset. In order to do this, we will have to use the
Kaggle API keys / credentials to download the Kaggle dataset directly.

In [None]:
# Getting Kaggle Credentials
get_kaggle_credentials()

# Defining the username and name of the dataset
KAGGLE_DATASET_USERNAME = "crowdflower"
KAGGLE_DATASET_NAME = "twitter-airline-sentiment"

dataset_df = get_kaggle_dataset_and_load(
    username=KAGGLE_DATASET_USERNAME,
    dataset_name=KAGGLE_DATASET_NAME,
)


In [None]:
dataset_df.head()