### LSE Data Analytics Online Career Accelerator 

# DA301:  Advanced Analytics for Organisational Impact

### Scenario
You are a data analyst working for Turtle Games, a game manufacturer and retailer. They manufacture and sell their own products, along with sourcing and selling products manufactured by other companies. Their product range includes books, board games, video games and toys. They have a global customer base and have a business objective of improving overall sales performance by utilising customer trends. In particular, Turtle Games wants to understand: 
- how customers accumulate loyalty points (Week 1)
- how useful are remuneration and spending scores data (Week 2)
- can social data (e.g. customer reviews) be used in marketing campaigns (Week 3)
- what is the impact on sales per product (Week 4)
- the reliability of the data (e.g. normal distribution, Skewness, Kurtosis) (Week 5)
- if there is any possible relationship(s) in sales between North America, Europe, and global sales (Week 6).

# Week 1 assignment: Linear regression using Python
The marketing department of Turtle Games prefers Python for data analysis. As you are fluent in Python, they asked you to assist with data analysis of social media data. The marketing department wants to better understand how users accumulate loyalty points. Therefore, you need to investigate the possible relationships between the loyalty points, age, remuneration, and spending scores. Note that you will use this data set in future modules as well and it is, therefore, strongly encouraged to first clean the data as per provided guidelines and then save a copy of the clean data for future use.

## Instructions
1. Load and explore the data.
    1. Create a new DataFrame (e.g. reviews).
    2. Sense-check the DataFrame.
    3. Determine if there are any missing values in the DataFrame.
    4. Create a summary of the descriptive statistics.
2. Remove redundant columns (`language` and `platform`).
3. Change column headings to names that are easier to reference (e.g. `renumeration` and `spending_score`).
4. Save a copy of the clean DataFrame as a CSV file. Import the file to sense-check.
5. Use linear regression and the `statsmodels` functions to evaluate possible linear relationships between loyalty points and age/renumeration/spending scores to determine whether these can be used to predict the loyalty points.
    1. Specify the independent and dependent variables.
    2. Create the OLS model.
    3. Extract the estimated parameters, standard errors, and predicted values.
    4. Generate the regression table based on the X coefficient and constant values.
    5. Plot the linear regression and add a regression line.
6. Include your insights and observations.

## 1. Load and explore the data

In [None]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm 
from statsmodels.formula.api import ols
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load the CSV file(s) as reviews.
reviews = pd.read_csv('turtle_reviews.csv')

# View the DataFrame.
print(reviews)

In [None]:
# Any missing values?
reviews.isna().sum()

In [None]:
# Explore the data.
print(reviews.shape)
print(reviews.columns)
print(reviews.dtypes)
reviews.head()

In [None]:
reviews.info()

In [None]:
# Descriptive statistics.
reviews.describe()

## 2. Drop columns

In [None]:
# Drop unnecessary columns.
reviews.drop(['language', 'platform'], inplace=True, axis=1)
reviews.info()

In [None]:
# View column names.
column_names = list(reviews.columns.values)
print("Column names:", column_names)

## 3. Rename columns

In [None]:
# Rename the column headers
reviews.rename(columns={'remuneration (k£)': 'remuneration', 'spending_score (1-100)': 'spending_score'}, inplace=True)
column_names = list(reviews.columns.values)

# View column names.
print("Column names:", column_names)

## 4. Save the DataFrame as a CSV file

In [None]:
# Create a CSV file as output.
reviews.to_csv(r'C:\Users\roser\Documents\LSE Data Analytics\Course 3\Assignment\reviews_new.csv')

In [None]:
# Import new CSV file with Pandas.
reviews_new = pd.read_csv('reviews_new.csv')

# View the DataFrame.
print(reviews_new)

## 5. Linear regression

### 5a) Spending vs Loyalty

In [None]:
# Define the dependent variable.
y = reviews_new['loyalty_points'] 

# Define the independent variable.
x = reviews_new['spending_score'] 

# Check for linearity with Matplotlib.
plt.scatter(x, y)

# Insert labels and title.
plt.xlabel('Loyalty points', fontsize=10)
plt.ylabel('Spending score', fontsize=10)

In [None]:
# OLS model and summary.
f = 'y ~ x'
test = ols(f, data = reviews_new).fit()

# Print the regression table.
test.summary() 

In [None]:
# Extract the estimated parameters.
print("Parameters: ", test.params)  

# Extract the standard errors.
print("Standard errors: ", test.bse)  

# Extract the predicted values.
print("Predicted values: ", test.predict())

In [None]:
# Set the X coefficient and the constant to generate the regression table.
y_pred = ( -75.0526) +  33.0616 * reviews_new['spending_score'] 
         
# View the output.
y_pred

In [None]:
# Plot the graph with a regression line.

# Plot the data points with a scatterplot.
plt.scatter(x, y)

# Plot the regression line (in black).
plt.plot(x, y_pred, color='black')

# Set the x and y limits on the axes.
plt.xlim(0)
plt.ylim(0)

# Insert labels and title.
plt.xlabel('Loyalty points', fontsize=10)
plt.ylabel('Spending score', fontsize=10)

# View the plot.
plt.show()

### 5b) Renumeration vs Loyalty

In [None]:
# Define the dependent variable.
y = reviews_new['loyalty_points'] 

# Define the independent variable.
x = reviews_new['remuneration'] 

# Check for linearity with Matplotlib.
plt.scatter(x, y)

# Insert labels and title.
plt.xlabel('Remuneration', fontsize=10)
plt.ylabel('Loyalty points', fontsize=10)

In [None]:
# OLS model and summary.
f = 'y ~ x'
test = ols(f, data = reviews_new).fit()

# Print the regression table.
test.summary() 

In [None]:
# Extract the estimated parameters.
print("Parameters: ", test.params)  

# Extract the standard errors.
print("Standard errors: ", test.bse)  

# Extract the predicted values.
print("Predicted values: ", test.predict())

In [None]:
# Set the X coefficient and the constant to generate the regression table.
y_pred = (-65.6865) + 34.1878 * reviews_new['remuneration'] 
         
# View the output.
y_pred

In [None]:
# Plot the graph with a regression line.

# Plot the data points with a scatterplot.
plt.scatter(x, y)

# Plot the regression line (in black).
plt.plot(x, y_pred, color='black')

# Set the x and y limits on the axes.
plt.xlim(0)
plt.ylim(0)

plt.xlabel('Remuneration', fontsize=10)
plt.ylabel('Loyalty points', fontsize=10)

# View the plot.
plt.show()

### 5c) Age vs Loyalty

In [None]:
# Define the dependent variable.
y = reviews_new['loyalty_points'] 

# Define the independent variable.
x = reviews_new['age'] 

plt.xlabel('Age', fontsize=10)
plt.ylabel('Loyalty points', fontsize=10)

# Check for linearity with Matplotlib.
plt.scatter(x, y)

In [None]:
# OLS model and summary.
f = 'y ~ x'
test = ols(f, data = reviews_new).fit()

# Print the regression table.
test.summary() 

In [None]:
# Extract the estimated parameters.
print("Parameters: ", test.params)  

# Extract the standard errors.
print("Standard errors: ", test.bse)  

# Extract the predicted values.
print("Predicted values: ", test.predict())

In [None]:
# Set the X coefficient and the constant to generate the regression table.
y_pred = (40.2034) + -0.0004 * reviews_new['age'] 
         
# View the output.
y_pred

In [None]:
# Plot the graph with a regression line.

# Plot the data points with a scatterplot.
plt.scatter(x, y)

# Plot the regression line (in black).
plt.plot(x, y_pred, color='black')

# Set the x and y limits on the axes.
plt.xlim(0)
plt.ylim(0)


plt.xlabel('Age', fontsize=10)
plt.ylabel('Loyalty points', fontsize=10)

# View the plot.
plt.show()

## 6. Observations and insights

***Your observations here...***






# 

# Week 2 assignment: Clustering with *k*-means using Python

The marketing department also wants to better understand the usefulness of renumeration and spending scores but do not know where to begin. You are tasked to identify groups within the customer base that can be used to target specific market segments. Use *k*-means clustering to identify the optimal number of clusters and then apply and plot the data using the created segments.

## Instructions
1. Prepare the data for clustering. 
    1. Import the CSV file you have prepared in Week 1.
    2. Create a new DataFrame (e.g. `df2`) containing the `renumeration` and `spending_score` columns.
    3. Explore the new DataFrame. 
2. Plot the renumeration versus spending score.
    1. Create a scatterplot.
    2. Create a pairplot.
3. Use the Silhouette and Elbow methods to determine the optimal number of clusters for *k*-means clustering.
    1. Plot both methods and explain how you determine the number of clusters to use.
    2. Add titles and legends to the plot.
4. Evaluate the usefulness of at least three values for *k* based on insights from the Elbow and Silhoutte methods.
    1. Plot the predicted *k*-means.
    2. Explain which value might give you the best clustering.
5. Fit a final model using your selected value for *k*.
    1. Justify your selection and comment on the respective cluster sizes of your final solution.
    2. Check the number of observations per predicted class.
6. Plot the clusters and interpret the model.

## 1. Load and explore the data

In [None]:
# Import necessary libraries.
import statsmodels.api as sm 
from statsmodels.formula.api import ols

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.metrics import accuracy_score
from scipy.spatial.distance import cdist

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load the CSV file(s) as df2.
df2 = pd.read_csv('reviews_new.csv')

# View DataFrame.
print(df2)

In [None]:
# Drop unnecessary columns.
df2.drop(['gender', 'age', 'loyalty_points', 'education', 'product', 'review', 'summary'], inplace=True, axis=1)
df2.info()

In [None]:
# Explore the data.
print(df2.shape)
print(df2.columns)
print(df2.dtypes)
df2.head()

In [None]:
# Descriptive statistics.
df2.describe()

## 2. Plot

In [None]:
# Create a scatterplot with Seaborn.
sns.scatterplot(x='remuneration', y='spending_score', data=df2)

plt.title('Scatterplot')
plt.xlabel('Remuneration', fontsize=10)
plt.ylabel('Spending score', fontsize=10)

plt.savefig('Scatterplot.png')

In [None]:
# Create a pairplot with Seaborn.
x = df2[['remuneration', 'spending_score']]

sns.pairplot(df2,
             vars=x,
             diag_kind='kde')

plt.savefig('Pairplot.png')

## 3. Elbow and silhoutte methods

In [None]:
# Determine the number of clusters: Elbow method.

# Import the KMeans class.
from sklearn.cluster import KMeans 

# Elbow chart to decide on the number of optimal clusters.
ss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i,
                    init='k-means++',
                    max_iter=300,
                    n_init=10,
                    random_state=0)
    kmeans.fit(x)
    ss.append(kmeans.inertia_)

# Plot the Elbow method.
plt.plot(range(1, 11),
         ss,
         marker='o')

# Insert labels and title.
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('fdsfsd')

plt.show()

In [None]:
# Determine the number of clusters: Silhouette method.

# Import silhouette_score class from sklearn.
from sklearn.metrics import silhouette_score

# Find the range of clusters to be used using the Silhouette method.
sil = []
kmax = 10

for k in range(2, kmax+1):
    kmeans_s = KMeans(n_clusters=k).fit(x)
    labels = kmeans_s.labels_
    sil.append(silhouette_score(x,
                                labels,
                                metric='euclidean'))

# Plot the Silhouette method.
plt.plot(range(2, kmax+1),
         sil,
         marker='o')

# Insert labels and title.
plt.title('The Silhouette Method')
plt.xlabel('Number of clusters')
plt.ylabel('Sil')

plt.show()

## 4. Evaluate k-means model at different values of *k*

In [None]:
# Use 5 clusters:
kmeans = KMeans(n_clusters = 5,
                max_iter = 15000,
                init='k-means++',
                random_state=42).fit(x)

clusters = kmeans.labels_
x['K-Means Predicted'] = clusters

# Plot the predicted.
sns.pairplot(x,
             hue='K-Means Predicted',
             diag_kind= 'kde')

## 5. Fit final model and justify your choice

In [None]:
# Check the number of observations per predicted class.
x['K-Means Predicted'].value_counts()

In [None]:
# View the K-Means predicte DataFrame.
print(x.head())

## 6. Plot and interpret the clusters

In [None]:
# Visualising the clusters.
# Set plot size.
sns.set(rc = {'figure.figsize':(12, 8)})

sns.scatterplot(x='spending_score' , 
                y ='remuneration',
                data=x , hue='K-Means Predicted',
                palette=['green', 'red', 'blue', 'black', 'orange'])

# Insert labels and title.
plt.title('Final Model')
plt.xlabel('Spending score')
plt.ylabel('Remuneration')

## 7. Discuss: Insights and observations

***Your observations here...***

# 

# Week 3 assignment: NLP using Python
Customer reviews were downloaded from the website of Turtle Games. This data will be used to steer the marketing department on how to approach future campaigns. Therefore, the marketing department asked you to identify the 15 most common words used in online product reviews. They also want to have a list of the top 20 positive and negative reviews received from the website. Therefore, you need to apply NLP on the data set.

## Instructions
1. Load and explore the data. 
    1. Sense-check the DataFrame.
    2. You only need to retain the `review` and `summary` columns.
    3. Determine if there are any missing values.
2. Prepare the data for NLP
    1. Change to lower case and join the elements in each of the columns respectively (`review` and `summary`).
    2. Replace punctuation in each of the columns respectively (`review` and `summary`).
    3. Drop duplicates in both columns (`review` and `summary`).
3. Tokenise and create wordclouds for the respective columns (separately).
    1. Create a copy of the DataFrame.
    2. Apply tokenisation on both columns.
    3. Create and plot a wordcloud image.
4. Frequency distribution and polarity.
    1. Create frequency distribution.
    2. Remove alphanumeric characters and stopwords.
    3. Create wordcloud without stopwords.
    4. Identify 15 most common words and polarity.
5. Review polarity and sentiment.
    1. Plot histograms of polarity (use 15 bins) for both columns.
    2. Review the sentiment scores for the respective columns.
6. Identify and print the top 20 positive and negative reviews and summaries respectively.
7. Include your insights and observations.

## 1. Load and explore the data

In [None]:
# Import all the necessary packages.
import nltk 
import os 

# nltk.download ('punkt').
# nltk.download ('stopwords').

from wordcloud import WordCloud
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
!pip install textblob
from textblob import TextBlob
from scipy.stats import norm

# Import Counter.
from collections import Counter

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load the data set as df3.
df3 = pd.read_csv('reviews_new.csv')

# View DataFrame.
print(df3)

In [None]:
# Explore data set.
print(reviews.shape)
print(reviews.columns)
print(reviews.dtypes)
reviews.head()

In [None]:
# Keep necessary columns. Drop unnecessary columns.
df3.drop(['gender', 'age', 'remuneration', 'spending_score', 'loyalty_points', 'education', 'product'], inplace=True, axis=1)
df3.info()

# View DataFrame.
print(df3)

In [None]:
# Determine if there are any missing values.
df3.isna().sum()

## 2. Prepare the data for NLP
### 2a) Change to lower case and join the elements in each of the columns respectively (review and summary)

In [None]:
# Review: Change all to lower case and join with a space.
df3['review'] = df3['review'].apply(lambda x: " ".join(x.lower() for x in x.split()))

# Preview the result.
df3['review'] .head()

In [None]:
# Summary: Change all to lower case and join with a space.
df3['summary'] = df3['summary'].apply(lambda x: " ".join(x.lower() for x in x.split()))

# Preview the result.
df3['summary'] .head()

### 2b) Replace punctuation in each of the columns respectively (review and summary)

In [None]:
# Replace all the punctuations in review column.
# Remove punctuation.
df3['review'] = df3['review'].str.replace('[^\w\s]','')

# View output.
df3['review'].head()

In [None]:
# Replace all the puncuations in summary column.
df3['summary'] = df3['summary'].str.replace('[^\w\s]','')

# View output.
df3['summary'].head()

### 2c) Drop duplicates in both columns

In [None]:
# Check review column for duplicates
df3.review.duplicated().sum()

In [None]:
# Drop duplicates.
df4 = df3.drop_duplicates(subset=['review'])

# Preview data.
df4.reset_index(inplace=True)
df4.head()

In [None]:
# Check summary column for duplicates
df4.summary.duplicated().sum()

In [None]:
# Drop duplicates.
df5 = df4.drop_duplicates(subset=['summary'])

# Preview data.
df5.reset_index(inplace=True)
df5.head()

In [None]:
df5.shape

## 3. Tokenise and create wordclouds

In [None]:
# Create new DataFrame (copy DataFrame).
df6 = df5
# View DataFrame.

df6

In [None]:
# String all the comments together in a single variable.
# Create an empty string variable.
all_comments = ''
for i in range(df6.shape[0]):
    # Add each comment.
    all_comments = all_comments + df6['review'][i]

In [None]:
# Set the colour palette.
sns.set(color_codes=True)

# Create a WordCloud object.
word_cloud = WordCloud(width = 1600, height = 900, 
                background_color ='white',
                colormap = 'plasma', 
                stopwords = 'none',
                min_font_size = 10).generate(all_comments) 

# Plot the WordCloud image.                    
plt.figure(figsize = (16, 9), facecolor = None) 
plt.imshow(word_cloud) 
plt.axis('off') 
plt.tight_layout(pad = 0) 
plt.show()

In [None]:
# Apply tokenisation to both columns.
# Tokenise the words.

df6['tokens'] = df6['review'].apply(word_tokenize)

# Preview data.
df6['tokens'].head()

In [None]:
# Apply tokenisation to both columns.
# Tokenise the words.

df6['tokens2'] = df6['summary'].apply(word_tokenize)

# Preview data.
df6['tokens2'].head()

In [None]:
# Define an empty list of tokens.
all_tokens = []

for i in range(df6.shape[0]):
    # Add each token to the list.
    all_tokens = all_tokens + df6['tokens2'][i]

## 4. Frequency distribution and polarity
### 4a) Create frequency distribution

In [None]:
# Import the FreqDist class.
from nltk.probability import FreqDist

# Calculate the frequency distribution.
fdist = FreqDist(all_tokens)

# Preview data.
fdist


### 4b) Remove alphanumeric characters and stopwords

In [None]:
# Delete all the alpanum.
tokens = [word for word in all_tokens if word.isalnum()]

In [None]:
# Download the stopword list.
nltk.download ('stopwords')
from nltk.corpus import stopwords

# Create a set of English stopwords.
english_stopwords = set(stopwords.words('english'))

# Create a filtered list of tokens without stopwords.
tokens2 = [x for x in tokens if x.lower() not in english_stopwords]

# Define an empty string variable.
tokens2_string = ''

for value in tokens:
    # Add each filtered token word to the string.
    tokens2_string = tokens2_string + value + ' '

In [None]:
# Download the stopword list.
nltk.download ('stopwords')
from nltk.corpus import stopwords

# Create a set of English stopwords.
english_stopwords = set(stopwords.words('english'))

# Create a filtered list of tokens without stopwords.
tokens2 = [x for x in tokens if x.lower() not in english_stopwords]

# Define an empty string variable.
tokens_string = ''

for value in tokens:
    # Add each filtered token word to the string.
    tokens2_string = tokens2_string + value + ' '

### 4c) Create wordcloud without stopwords

In [None]:
# Create a wordcloud without stop words.
# Create a WordCloud.
wordcloud = WordCloud(width = 1600, height = 900, 
                background_color ='white', 
                colormap='plasma', 
                min_font_size = 10).generate(tokens2_string) 

# Plot the WordCloud image.                        
plt.figure(figsize = (16, 9), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis('off') 
plt.tight_layout(pad = 0) 
plt.show()

### 4d) Identify 15 most common words and polarity

In [None]:
# Determine the 15 most common words.
# Import the class.
from nltk.probability import FreqDist

# Create a frequency distribution object.
freq_dist_of_words = FreqDist(tokens2)

# Show the fifteen most common elements in the data set.
freq_dist_of_words.most_common(15)

## 5. Review polarity and sentiment: Plot histograms of polarity (use 15 bins) and sentiment scores for the respective columns.

In [None]:
# import the prebuilt rules and values of the vader lexicon.
nltk.download('vader_lexicon')

In [None]:
# Import the vader class SentimentIntensityAnalyser.
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Create a variable sia to store the SentimentIntensityAnalyser() method.
sia = SentimentIntensityAnalyzer()

In [None]:
# Run through a dictionary comprehension to take every cleaned comment
df_polarity = {" ".join(_) : sia.polarity_scores(" ".join(_)) for _ in df6}

In [None]:
# Convert the list of dictionary results to a DataFrame.
polarity = pd.DataFrame(df_polarity).T

# View the DataFrame.
polarity

In [None]:
sentiment_scores = df6['review'].apply(lambda x : sia.polarity_scores(x))
df6['polarity'] = sentiment_scores.apply(lambda x: x['compound'])

df6.head()

In [None]:
# Summary: Create a histogram plot with bins = 15.
plt.hist(df6['polarity'], bins=15, alpha)

# Histogram of sentiment score
plt.show()

## 6. Identify top 20 positive and negative reviews and summaries respectively

In [None]:
# Top 20 negative reviews.


# View output.


In [None]:
# Top 20 negative summaries.


# View output.


In [None]:
# Top 20 positive reviews.


# View output.


In [None]:
# Top 20 positive summaries.


# View output.


## 7. Discuss: Insights and observations

***Your observations here...***

# 