# Sentiment Analysis on notable speeches of the last decade

## Introduction

This notebook demonstrates how to build a simple `Long Short Term memory network (LSTM)` from scratch in NumPy to perform sentiment analysis on a socially relevant and ethically acquired dataset.

This deep learning model (the LSTM) is a form of a `Recurrent Neural Network` and will learn to classify a piece of text as positive or negative from the IMDB reviews dataset. The dataset contains 50000 movie reviews and corresponding labels. Based on the numeric representations of these reviews and their corresponding labels `(supervised learning)` the neural network will be trained to learn the sentiment using `forward propagation` and `backpropagation` through time since we are dealing with sequential data here. The output will be a vector containing the probabilities that the text samples are positive.

## Requirements

* `Python`
* Array manipulation using `NumPy`
* Basic understanding of `linear algebra` and `calculus`
* Understanding how `Neural Networks` work
* `pandas` for handling dataframes
* `Matplotlib` for data visualization
* `pooch` to download and cache datasets

#### 1. Data Collection

Before beginning, there are a few pointers to be always kept in mind before choosing the data in order to train the model on:

* __Identifying Data Bias__: Bias is an inherent component of the human thought process. Therefore data sourced from human activities reflects that bias. Some ways in which this bias tends to occur in Machine Learning datasets are:
    * _Bias in historical data_: Historical data are often skewed towards, or against, particular groups. Data can also be severely imbalanced with limited information on protected groups.
    * _Bias in data collection mechanisms_: Lack of representativeness introduces inherent biases in the data collection process.
    * _Bias towards observable outcomes_: In some scenarios, there are cases about True Outcomes only for a certain section of the population. In the absence of information on all outcomes, one cannot even measure fairness.

* __Preserving human anonymity for sensitive data__: `Trevisan` and `Reilly` identified a list of sensitive topics that need to be handled with extra care. The same is presented below along with a few additions:
    * personal daily routines (including location data)
    * individual details about impairment and/or medical records
    * emotional accounts of pain and chronic illness
    * financial information about income and/or welfare payments
    * discrimination and abuse episodes
    * criticism/praise of individual providers of healthcare and support services
    * suicidal thoughts
    * criticism/praise of a power structure especially if it compromises their safety
    * personally-identifying information (even if anonymized in some way) including things like fingerprints or voice

> While it can be difficult taking consent from so many people especially on on-line platforms, the necessity of it depends upon the sensitivity of the topics your data includes and other indicators like whether the platform the data was obtained from allows users to operate under pseudonyms. If the website has a policy that forces the use of a real name, then the users need to be asked for consent.

In this section, two different datasets will be collected: the IMDb movie reviews dataset, and a collection of 10 speeches curated for this tutorial including activists from different countries around the world, different times, and different topics. The former would be used to train the deep learning model while the latter will be used to perform sentiment analysis on.