<a id='crisp-dm_framework'></a>
###  0. CRISP-DM Framework
***
*"CRISP-DM, which stands for Cross-Industry Standard Process for Data Mining, is an industry-proven way to guide your data mining efforts.*

*As a methodology , it includes descriptions of the typical phases of a project, the tasks involved with each phase, and an explanation of the relationships between these tasks.*

*As a process model , CRISP-DM provides an overview of the data mining life cycle."* (2021, IBM CRISP-DM Overview)

In [107]:
import warnings
from IPython.display import display, HTML  # Update the import statement to resolve the deprecation warning

# Suppress deprecation warnings only for this block
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=DeprecationWarning)
    display(HTML('<center><img src="https://www.ibm.com/docs/en/SS3RA7_18.1.0/modeler_crispdm_ddita/clementine/images/crisp_process.jpg" width=600 height=300 /></center>'))    

### Business/ Research Understanding Phase

* **Define project requirements and objectives**
    * 
    

* **Translate objectives into data exploration problem definition**
    * High Level Primary objectives (not exhaustive)
        *        
        
* **Prepare preliminary strategy to meet objectives**
    * Data Required: 
    * Tools: Python via Jupyter for data preprocessing, analysis, visualisation, machine learning. Some MS Excel might be used if required


### Data Understanding Phase
* **Collect data**
    * 
    * 
* **Perform exploratory data analysis (EDA)**

* **Assess data quality**
    * Check data for duplicate records, missing valyes and inconsistent data types.
* **Optionally, select interesting subsets**
    * Identify if there’s any particular areas or anomlies which require further investigation

### Data Preparation Phase
* **Prepares for modelling in subsequent phases**
    * 
* **Select cases and variables appropriate for analysis**
    * Choose relevant columns and data subsets to include
* **Cleanse and prepare data so it is ready for modeling tools**
    * How to handle missing values
    * How to deal with data formats and standardisation

* **Perform transformation of certain variables, if needed**
    * Create new columns and variables if required

### Modelling Phase
* **Select and apply one or more modelling techniques**
* Calibrate model settings to optimize results
* Adjust hyperparameters and validate the model using cross-validation.

### Evaluation Phase
* **Evaluate one or more models for effectiveness**
    * Assess the model using appropriate metrics
    
* **Determine whether defined objectives achieved**
    * Validate whether the model(s) answer the research questions effectively
* **Make decision regarding data exploration results before deploying to field**
    * Analyse whether model(s) could be deployed in practice for this particular use case / problem statement

### Deployment Phase
* **Make use of models created**
* **Simple deployment example: generate report**
* **Complex deployment example: implement parallel data exploration effort in another department**
* **In businesses, customer often carries out deployment based on your model**
***

In [1]:
# Import EDA & visualisation libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.plotting import scatter_matrix

# Import statistics libraries from SciPy
from scipy import stats
from scipy.stats import norm
from scipy.stats import shapiro
from scipy.stats import pearsonr

# Import machine learning libraries from sklearn
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# Display all columns of the pandas df
pd.set_option('display.max_columns', None)

# Configure default colour scheme for seaborn
sns.set(color_codes = True)

# Suppress warning messages
import warnings
warnings.filterwarnings('ignore')

In [None]:
from pyjstat import pyjstat

# Read from json-stat
dataset = pyjstat.Dataset.read('')

df = dataset.write('dataframe')
df.to_json('dataframe_to_json.json')

dictionary = {"a": 3, "b": 27}
json.dumps([1, 2, 3])

### Twitter API Extraction

In [111]:
import tweepy

consumer_key = "GSz7z4FYt6AOuzKyqHb2asdEX" # Your API/Consumer key 
consumer_secret = "yzttIifaRMS043sTzwWPZspdQg9Yeg7jrsCgsuHVsapOws3vEI" # Your API/Consumer Secret Key
access_token = "2968488129-YKIbCWXLzEv4KSh6HyvLqnBXf2saPgXMrVPPwwy"   # Your Access token key
access_token_secret = "tWhwx5XImqMyxj81bnRbSPTlJLywd5yCHBPXdp5Kplebn" # Your Access token Secret key

# Pass in our twitter API authentication key
auth = tweepy.OAuth1UserHandler(
    consumer_key, consumer_secret,
    access_token, access_token_secret
)

# Instantiate the tweepy API
api = tweepy.API(auth, wait_on_rate_limit=True)


search_query = "'dublin' 'bus'-filter:retweets AND -filter:replies AND -filter:links"
no_of_tweets = 1

try:
    #The number of tweets we want to retrieved from the search
    tweets = api.search_tweets(q=search_query, lang="en", count=no_of_tweets, tweet_mode ='extended')
    
    #Pulling Some attributes from the tweet
    attributes_container = [[tweet.user.name, tweet.created_at, tweet.favorite_count, tweet.source, tweet.full_text] for tweet in tweets]

    #Creation of column list to rename the columns in the dataframe
    columns = ["User", "Date Created", "Number of Likes", "Source of Tweet", "Tweet"]
    
    #Creation of Dataframe
    tweets_df = pd.DataFrame(attributes_container, columns=columns)
except BaseException as e:
    print('Status Failed On,',str(e))

Status Failed On, 403 Forbidden
453 - You currently have access to a subset of Twitter API v2 endpoints and limited v1.1 endpoints (e.g. media post, oauth) only. If you need access to this endpoint, you may need a different access level. You can learn more here: https://developer.twitter.com/en/portal/product


I initially created a Twitter developer account so I could access the Twitter API and pull tweets as part of sentiment analysis using the Python library 'tweepy' which is an easy-to-use Python library for accessing the Twitter API.

https://developer.twitter.com/en/portal/dashboard
<br>https://www.tweepy.org/

Using the YouTube tutorial below I followed the steps using the 'tweepy' Python package and the above code to try and pull tweets where the text 'dublin bus' was present, setting the 'no_of_tweets = 1' to test that my connection was correctly set up. However Twitter have now changed their free account usage so users are not allowed to pull tweets for free anymore and the below error message is received:

Error Message:

Status Failed On, 403 Forbidden
453 - You currently have access to a subset of Twitter API v2 endpoints and limited v1.1 endpoints (e.g. media post, oauth) only. If you need access to this endpoint, you may need a different access level. You can learn more here: https://developer.twitter.com/en/portal/product

Tutorials where code was leveraged:

https://www.youtube.com/watch?v=fHHDM2-If9g
https://github.com/analyticswithadam/Python/blob/main/Pulling_Tweets.ipynb

### Reddit API Extraction

In [74]:
import praw

# Replace the following with your own information
client_id = "GMI7RpsM8QSiuCjNdAMzBA"
client_secret = "BCvUIpACDSGBvkMvEOWf2rpB1XL1VA"
user_agent = "web_scrapper_api"
username = "TopBrass883"
password = "ILoveArsenalFC254!"

# Create a Reddit instance
reddit = praw.Reddit(
    client_id=client_id,
    client_secret=client_secret,
    user_agent=user_agent,
    username=username,
    password=password)

In [113]:
# Specify search query for Reddit API
query = 'public transport'

# Create a list to store data
posts = []

# Search for the query in posts
for submission in reddit.subreddit('Ireland').search(query, sort = 'all', limit = None):
    posts.append([submission.title, submission.score, submission.id, submission.subreddit, submission.url, 
                  submission.num_comments, submission.selftext, submission.created])

# Create a pandas df with API query results and column headers
columns = ['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created']
df = pd.DataFrame(posts, columns = columns)

# Convert the 'created' column timestamp to a readable format
df['created'] = pd.to_datetime(df['created'], unit='s')

df.head()

#df.to_csv('reddit_comments.csv', index=False)

Unnamed: 0,title,score,id,subreddit,url,num_comments,body,created
0,Eamon Ryan: Free public transport would 'incre...,908,11ksg57,ireland,https://www.irishexaminer.com/news/arid-410867...,705,,2023-03-07 07:18:55
1,Dublin is worst capital in Europe for public t...,1335,1379naz,ireland,https://www.irishtimes.com/environment/climate...,277,Shame they didn't mention speed to get in and ...,2023-05-04 04:16:19
2,Oddest experience you’ve ever had on Irish pub...,423,127gzo8,ireland,https://www.reddit.com/r/ireland/comments/127g...,468,What’s the weirdest thing you’ve ever gone thr...,2023-03-31 10:35:40
3,Poll: Would you support a subsidised €9 monthl...,287,15y2344,ireland,https://www.thejournal.ie/poll-your-say-2-6148...,186,,2023-08-22 10:44:13
4,'Too many cars' on our roads hampering public ...,172,185t7vv,ireland,https://www.thejournal.ie/dublin-bus-transport...,272,,2023-11-28 10:43:01


Like Twitter, Reddit is another large social network of communities where people gather to discuss various topics.

I created an application in my Reddit account called 'web_scraper_api', and using the 'praw' (Python Reddit API Wrapper) library I was able to pull posts based on my specific query criteria e.g. posts with the words 'public transport' in the Subreddit 'Ireland' to ensure that the results are relevant.

https://praw.readthedocs.io/en/stable/index.html


Tutorial where code was leveraged:

https://www.youtube.com/watch?v=gIZJQmX-55U&ab_channel=PyMoondra

In [88]:
import nltk

# Download the lexicon
nltk.download("vader_lexicon")

# Import the lexicon 
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\shass\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [95]:
# Create an instance of SentimentIntensityAnalyzer
sentiment_analyser = SentimentIntensityAnalyzer()

# Example
sentence = "VADER is shit at identifying the underlying sentiment of a text!"
print(sent_analyzer.polarity_scores(sentence))

{'neg': 0.302, 'neu': 0.698, 'pos': 0.0, 'compound': -0.5983}


In [96]:
# Define a function to get the compound sentiment score
def get_sentiment_score(text):
    return sentiment_analyser.polarity_scores(text)['compound']

# Apply the function to the 'title' column
df['sentiment_score'] = df['title'].apply(get_sentiment_score)

In [97]:
df.head()

Unnamed: 0,title,score,id,subreddit,url,num_comments,body,created,sentiment_score
0,Dublin is worst capital in Europe for public t...,1337,1379naz,ireland,https://www.irishtimes.com/environment/climate...,277,Shame they didn't mention speed to get in and ...,2023-05-04 04:16:19,-0.6249
1,€25 billion Dublin transport plan to be published,80,10k17sk,ireland,https://www.irishtimes.com/ireland/dublin/2023...,110,,2023-01-24 08:54:56,0.0
2,Revised Dublin transport plan sees costs doubl...,383,qq0r0m,ireland,https://www.irishtimes.com/news/environment/re...,201,,2021-11-09 10:22:36,-0.2263
3,'Too many cars' on our roads hampering public ...,173,185t7vv,ireland,https://www.thejournal.ie/dublin-bus-transport...,272,,2023-11-28 10:43:01,0.0
4,Public transport in Dublin: some general thoughts,150,15x63ju,ireland,https://www.reddit.com/r/ireland/comments/15x6...,89,I'm back in Ireland for the summer after a lon...,2023-08-21 12:20:40,0.0


In [101]:
def format_output(compound_score):
    polarity = "neutral"
    if compound_score >= 0.05:
        polarity = "positive"
    elif compound_score <= -0.05:
        polarity = "negative"
    return polarity

# Assuming 'df' is your DataFrame and 'sentiment_score' is the column with compound scores
df['polarity'] = df['sentiment_score'].apply(format_output)

In [102]:
df.head()

Unnamed: 0,title,score,id,subreddit,url,num_comments,body,created,sentiment_score,polarity
0,Dublin is worst capital in Europe for public t...,1337,1379naz,ireland,https://www.irishtimes.com/environment/climate...,277,Shame they didn't mention speed to get in and ...,2023-05-04 04:16:19,-0.6249,negative
1,€25 billion Dublin transport plan to be published,80,10k17sk,ireland,https://www.irishtimes.com/ireland/dublin/2023...,110,,2023-01-24 08:54:56,0.0,neutral
2,Revised Dublin transport plan sees costs doubl...,383,qq0r0m,ireland,https://www.irishtimes.com/news/environment/re...,201,,2021-11-09 10:22:36,-0.2263,negative
3,'Too many cars' on our roads hampering public ...,173,185t7vv,ireland,https://www.thejournal.ie/dublin-bus-transport...,272,,2023-11-28 10:43:01,0.0,neutral
4,Public transport in Dublin: some general thoughts,150,15x63ju,ireland,https://www.reddit.com/r/ireland/comments/15x6...,89,I'm back in Ireland for the summer after a lon...,2023-08-21 12:20:40,0.0,neutral


In [104]:
# Group by the 'polarity' column and count the number of rows for each category
polarity_counts = df.groupby('polarity').size()

print(polarity_counts)

polarity
negative     44
neutral     135
positive     52
dtype: int64


In [105]:
import pandas as pd
import requests

# Define the URL where the JSON data is available
url = "https://data.ssb.no/api/v0/en/table/11347/"

# Make a GET request to fetch the raw JSON content
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the JSON content and convert it into a pandas DataFrame
    data = response.json()
    df = pd.json_normalize(data)
else:
    df = pd.DataFrame()  # Empty DataFrame if the request failed

df.head()  # Display the first few rows of the DataFrame

Unnamed: 0,title,variables
0,11347: Public transport. Ticket revenues and p...,"[{'code': 'TransportA', 'text': 'mode of trans..."
