<a href="https://colab.research.google.com/github/ygebre1/bitcoin-price-predictor/blob/trial1/sentiments_live.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Sentiment Analysis
Sentiment analysis, also known as opinion mining, is the process of using natural language processing (NLP), machine learning, and text analysis to determine the emotional tone behind a body of text. It is commonly used to analyze opinions, attitudes, and emotions expressed in written language.

###Mwclient
[mwclient](https://mwclient.readthedocs.io/en/latest/) is a Python library used to interact with MediaWiki-based websites, such as Wikipedia and Wikidata. It provides an API client that allows users to fetch, edit, and manage content on MediaWiki-powered platforms programmatically.

In [1]:
!pip install mwclient

Collecting mwclient
  Downloading mwclient-0.11.0-py3-none-any.whl.metadata (3.7 kB)
Downloading mwclient-0.11.0-py3-none-any.whl (33 kB)
Installing collected packages: mwclient
Successfully installed mwclient-0.11.0


In [2]:
import mwclient
import time

site = mwclient.Site('en.wikipedia.org')
page = site.pages['Bitcoin']

In [3]:
revs = list(page.revisions())

  and should_run_async(code)


In [4]:
revs[0]

  and should_run_async(code)


OrderedDict([('revid', 1272296704),
             ('parentid', 1272293121),
             ('minor', ''),
             ('user', 'JivanP'),
             ('timestamp',
              time.struct_time(tm_year=2025, tm_mon=1, tm_mday=28, tm_hour=0, tm_min=11, tm_sec=9, tm_wday=1, tm_yday=28, tm_isdst=-1)),
             ('comment', '/* Mining */ Edit some language for clarity')])

In [5]:
revs = sorted(revs, key=lambda rev: rev['timestamp'])

  and should_run_async(code)


In [6]:
revs[0]

  and should_run_async(code)


OrderedDict([('revid', 275832581),
             ('parentid', 0),
             ('user', 'Pratyeka'),
             ('timestamp',
              time.struct_time(tm_year=2009, tm_mon=3, tm_mday=8, tm_hour=16, tm_min=41, tm_sec=7, tm_wday=6, tm_yday=67, tm_isdst=-1)),
             ('comment', 'creation (stub)')])

###Transformers Library
The sentiment analysis model within the Transformers library, created by Hugging Face, is specifically designed for analyzing text to determine whether the sentiment is positive, negative, or neutral.

In [7]:
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis")

def find_sentiment(text):
    sent = sentiment_pipeline([text[:250]])[0]
    score = sent['score']
    if sent['label'] == 'NEGATIVE':
        score *= -1
    return score

  and should_run_async(code)
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


In [8]:
find_sentiment("I love it.")

  and should_run_async(code)


0.9998767375946045

In [9]:
find_sentiment("I hate it.")

  and should_run_async(code)


-0.9996691942214966

##Use ThreadPoolExecutor for Parallel Sentiment Analysis
* time: Used for timestamp conversion.
* defaultdict (from collections): Creates a dictionary with default values to avoid unnecessary key checks.
* ThreadPoolExecutor (from concurrent.futures): Enables parallel execution to speed up sentiment analysis.

##Why is the ThreadPoolExecutor Approach Faster?
* Parallel Processing with ThreadPoolExecutor
    * Sequential Approach(My Initial Approach):
    ```python
    for rev in revs:
        sentiment = find_sentiment(comment)  # Runs one at a time (slow)
    ```
        * Each call to `find_sentiment(comment)` is blocking and runs one at a time.
        * If find_sentiment() takes 0.5 seconds per revision, and there are 100,000 revisions, total execution time is:
            * 100,000 × 0.5 sec = 50,000 sec (~13.8 hours)
    * ThreadPoolExecutor Approach(Revised Approach):
    ```python
    with ThreadPoolExecutor(max_workers=8) as executor:
        results = list(executor.map(process_revision, revs))
    ```
        * Processes multiple revisions at the same time using 8 threads.
        * If `find_sentiment(comment)` takes 0.5 sec per revision, but we process 8 at once, execution time is:
        * If find_sentiment() takes 0.5 seconds per revision, and there are 100,000 revisions, total execution time is:
            * (100,000 / 8) × 0.5 sec = 6,250 sec (~1.7 hours)
* Parallel execution massively reduces runtime.

In [10]:
import time
from collections import defaultdict
from concurrent.futures import ThreadPoolExecutor

# Dictionary to store results efficiently
edits = defaultdict(lambda: {"sentiments": [], "edit_count": 0})

# Function to process each revision
def process_revision(rev):
    # Convert struct_time to date string
    date = time.strftime('%Y-%m-%d', rev['timestamp'])

    # Get comment safely
    comment = rev.get("comment", "")

    # Run sentiment analysis (can be slow)
    sentiment = find_sentiment(comment)

    return date, sentiment

# Use ThreadPoolExecutor for parallel execution (better for Colab)
with ThreadPoolExecutor(max_workers=8) as executor:  # Adjust workers if needed
    results = list(executor.map(process_revision, revs))

# Populate the edits dictionary efficiently
for date, sentiment in results:
    edits[date]["edit_count"] += 1
    edits[date]["sentiments"].append(sentiment)

print("Processing complete!")



edits = {}

for rev in revs:
    date = time.strftime('%Y-%m-%d', rev['timestamp'])

    if date not in edits:
        edits[date] = dict(sentiments=list(), edit_count=0)

    edits[date]['edit_count'] += 1

    comment = rev.get("comment","")
    edits[date]['sentiments'].append(find_sentiment(comment))

  and should_run_async(code)


Processing complete!


In [11]:
from statistics import mean

for key in edits:
    if len(edits[key]['sentiments']) > 0:
        edits[key]['sentiment'] = mean(edits[key]['sentiments'])
        edits[key]['neg_sentiment'] = len([s for s in edits[key]['sentiments'] if s < 0]) / len(edits[key]['sentiments'])
    else:
        edits[key]['sentiment'] = 0
        edits[key]['neg_sentiment'] = 0

    del edits[key]['sentiments']

  and should_run_async(code)


In [12]:
edits

  and should_run_async(code)


defaultdict(<function __main__.<lambda>()>,
            {'2009-03-08': {'edit_count': 4,
              'sentiment': -0.5505250692367554,
              'neg_sentiment': 0.75},
             '2009-08-05': {'edit_count': 1,
              'sentiment': 0.7481208443641663,
              'neg_sentiment': 0.0},
             '2009-08-06': {'edit_count': 2,
              'sentiment': 0.9957457184791565,
              'neg_sentiment': 0.0},
             '2009-08-14': {'edit_count': 1,
              'sentiment': 0.930020809173584,
              'neg_sentiment': 0.0},
             '2009-10-13': {'edit_count': 2,
              'sentiment': -0.22750061750411987,
              'neg_sentiment': 0.5},
             '2009-11-18': {'edit_count': 1,
              'sentiment': 0.8839507699012756,
              'neg_sentiment': 0.0},
             '2009-12-08': {'edit_count': 1,
              'sentiment': -0.9869275689125061,
              'neg_sentiment': 1.0},
             '2009-12-17': {'edit_count': 1,
    

In [13]:
import pandas as pd
edits_df = pd.DataFrame.from_dict(edits, orient='index')

  and should_run_async(code)


In [14]:
edits_df

  and should_run_async(code)


Unnamed: 0,edit_count,sentiment,neg_sentiment
2009-03-08,4,-0.550525,0.750000
2009-08-05,1,0.748121,0.000000
2009-08-06,2,0.995746,0.000000
2009-08-14,1,0.930021,0.000000
2009-10-13,2,-0.227501,0.500000
...,...,...,...
2025-01-18,2,-0.001296,0.500000
2025-01-19,3,-0.325052,0.666667
2025-01-26,4,-0.995555,1.000000
2025-01-27,3,-0.991851,1.000000


In [15]:
edits_df.index = pd.to_datetime(edits_df.index)

  and should_run_async(code)


In [16]:
from datetime import datetime

dates = pd.date_range(start="2009-03-08",end=datetime.today())

  and should_run_async(code)


In [17]:
dates

  and should_run_async(code)


DatetimeIndex(['2009-03-08', '2009-03-09', '2009-03-10', '2009-03-11',
               '2009-03-12', '2009-03-13', '2009-03-14', '2009-03-15',
               '2009-03-16', '2009-03-17',
               ...
               '2025-02-02', '2025-02-03', '2025-02-04', '2025-02-05',
               '2025-02-06', '2025-02-07', '2025-02-08', '2025-02-09',
               '2025-02-10', '2025-02-11'],
              dtype='datetime64[ns]', length=5820, freq='D')