## Stage 1: Exploratory Data Analysis

Steps / Proceeeding:
- Look at the structure of the data: number of data points, number of features, feature names, data types, etc.
- When dealing with multiple data sources, check for consistency across datasets.
- Identify what data signifies (called measures) for each of data points and be mindful while obtaining metrics.
- Calculate key metrics for each data point (summary analysis): a. Measures of central tendency (Mean, Median, Mode); b. Measures of dispersion (Range, Quartile Deviation, Mean Deviation, Standard Deviation); c. Measures of skewness and kurtosis.
- Investigate visuals: a. Histogram for each variable; b. Scatterplot to correlate variables.
- Calculate metrics and visuals per category for categorical variables (nominal, ordinal).
- Identify outliers and mark them. Based on context, either discard outliers or analyze them separately.
- Estimate missing points using data imputation techniques.
- Estimate data quality.


Measures of central tendency 

A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. These include the following:
- Mean: Mean is equal to the sum of all the values in the data set divided by the number of values in the data set. This is also called arithmetic mean. Other means such as geometric mean and harmonic mean are also sometimes useful.
- Median: Median is the middle score for a set of data that has been arranged in order of magnitude. For example, given an ordered list of student marks, [14 35 45 55 55 56 58 65 87 89 92], median is 56 because it is the middle mark since there are 5 items before it, 5 items after it.
- Mode: Mode is the most frequent score in our data set. For the above data set of student marks, mode is 55 because 55 is repeated for the maximum number of times.

Skewness & Kurtosis

Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. 
That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers.


Measures of Dispersion

- Range is the difference between the smallest value and the largest value in the data set. This is the simplest measure but it's based on extreme values and tells nothing about the data in between.
- Standard Deviation is therefore a better measure. A value within ±1 SD from mean is considered normal; a value beyond ±3 SD is considered extremely abnormal. One alternative to this is a simple measure called Mean Absolute Deviation (MAD). Another alternative, often used as a measurement of error, is Root Mean Square Anomaly (RMSA).
- If one desires the spread of data around the central region of data, Quartile Deviation is a good measure. This is half of what's called Interquartile Range (IQR). A variation of this that considers all data is called Median Absolute Deviation (MAD).

Outliers
Any observation that appears to deviate markedly from other observations in the sample is considered an outlier. Identifying an observation as an outlier depends on the underlying distribution of the data. Determining whether an observation is an outlier or not is a subjective exercise.
- Context dictates whether to focus on or get rid of outliers. For example, in an income distribution, a luxury brand company would focus on the outliers (the rich people) while a Government public distribution system would choose to get rid of the outliers. It's recommended that you generate a normal probability plot of the data before applying an outlier test.
- Outliers can also come in different flavours, depending on the environment: point outliers, contextual outliers, or collective outliers.



Analysis:
- Average length by datatype
- Word segmentation and word frequencies
- Number of documents by company
- Using TFIDF to find the most characteristic words by company
- Timeseries of ESG topic distributions to analyse patterns over time

Text Specific:

- Frequency: token, type etc. frequency analysis
- Length: token, sentence length analysis
- N-gram explorations
- Unknown words
- Style and domain specificity
- Error analysis, e.g., spelling errors
- Semantic relations (synonymity, hyperonyms)
- Linguistics units: POS, NER
- Topic exploration
- Keyword analysis
- Comprehension/text complexity metrics
- Similarity metrics
- Sentiment


In [None]:
nltk.sentiment.vader SentimentIntensityAnalyzer

textblob sentiment.polarity

In [None]:
# Report length

# By ESG topic

# Add sector by company