<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">


# DSI-SG-42 Project 3: Web APIs & NLP
### Reddit Scams: Are We Vulnerable?
---

In this project, our primary aim is to construct a robust classification model that can accurately discern the origin of posts from two distinct subreddits: r/RandomKindness and r/Scams. The overarching objective is to empower subreddit moderators with a proactive tool to enhance their efficacy in identifying and mitigating potential scam posts within their respective communities. By analyzing sentiment and identifying frequently used words in each subreddit, we can develop a basic understanding of the linguistic red flags that might signal a potential scam. Through machine learning techniques, we seek to provide moderators with a means to swiftly distinguish between genuine acts of kindness and fraudulent activities, thereby fostering safer and more trustworthy online environments. Our study aims to answer the following problem statement

### **Problem Statement:**

How Might We Assist Reddit Moderators to Differentiate Online Scams from Legitimate Acts of Kindness within Reddit Posts using a Sensitive & Accurate Natural Language Processing Model


# 1. Webscraping from Reddit

<h2>1.1 Importing Libraries</h2>

In [4]:
# Imports for Web Scraping
import pandas as pd
import re
import datetime
import requests

import praw
from praw.models import MoreComments
from praw.exceptions import APIException

import time
import itertools

import string
import csv

In [5]:
# Initialize the Reddit API client
# Change to own user ID
reddit= praw.Reddit(
    client_id = 'YOUR_CLIENT_ID',
    client_secret = 'YOUR_CLIENT_SECRET',
    user_agent = 'YOUR_USER_AGENT'
)

<h2>1.2 Creating Functions to Get Posts from Subreddits</h2>

<h3>Creating Function to Get Posts from r/RandomKindness</h3>

In [6]:
def get_posts_with_comments(subreddit_name):
    count = 0
    try:
        posts = reddit.subreddit(subreddit_name).new(limit=1000)
            # posts = subreddit.search(search_term, limit=10000)  # Limit set to 1,000

        data = []  # Blank list to hold all the data

        for post in posts:
            # Implementing delay
            time.sleep(1)  # Sleep for 1 second between each request
            # Normalize the post URL and ID for inclusion with every entry
            post_url = post.url
            post_id = post.id

            # Append post title to the list, including ID and URL
            data.append({
                'subreddit': subreddit_name, 
                'type': 'title', 
                'text': post.title.replace('\n', ' '), 
                'id': post_id, 
                'url': post_url
            })
            # print(count) # Debugged
            post.comments.replace_more(limit=0)
            # Append post body to the list, including ID and URL
            data.append({
                'subreddit': subreddit_name, 
                'type': 'body', 
                'text': post.selftext,
                'id': post_id, 
                'url': post_url
            })  # Reduce "more" comments to manage the load
            for comment in post.comments.list():
                # Append each comment to the list, including post ID and URL for context
                data.append({
                    'subreddit': subreddit_name, 
                    'type': 'comment', 
                    'text': comment.body.replace('\n', ' '), 
                    'id': post_id, 
                    'url': post_url
                })
                    
                # Fetch and append replies for each comment, including post ID and URL
                for reply in comment.replies.list():
                    data.append({
                        'subreddit': subreddit_name, 
                        'type': 'reply', 
                        'text': reply.body.replace('\n', ' '), 
                        'id': post_id, 
                        'url': post_url
                    })
            count += 1
        return data
    except APIException as e:
        print(f"API Exception: {e}") 

# Fetch data from r/RandomKindess
data_randomkindness = get_posts_with_comments('RandomKindness')

In [12]:
# Print the first 5 rows of r/RandomKindness
data_randomkindness = get_posts_with_comments('RandomKindness')[:5]

data_randomkindness

[{'subreddit': 'RandomKindness',
  'type': 'title',
  'text': '[RK] An Introduction and Rules Reminder to r/RandomKindness',
  'id': '1bmbrr8',
  'url': 'https://www.reddit.com/r/RandomKindness/comments/1bmbrr8/rk_an_introduction_and_rules_reminder_to/'},
 {'subreddit': 'RandomKindness',
  'type': 'body',
  'text': "# If you're new (or not so new) to [/r/RandomKindness](https://www.reddit.com/r/RandomKindness), welcome!\n\nTo participate in our sub, please keep the following requirements and rules in mind:\n\n## Account Requirements\n\n## To make requests and take up offers:\n\n* Your account MUST be at least 180 days old\n* Have at least 500 *COMMENT* Karma\n* Have recent productive (non-spam, non-request, non-gift exchange) activity for the past 90 days to post a request or enter to receive on offers\n\nPlease note, we're asking for *COMMENT* Karma. Not combined or total karma, which you normally see listed.\n\nYour Comment Karma is listed on the desktop when you mouse hover over you

<h3>Creating Function to Get Posts from r/Scams</h3>

In [7]:
def get_posts_with_comments(subreddit_name):
    count = 0
    try:
        posts = reddit.subreddit(subreddit_name).new(limit=1000)
            # posts = subreddit.search(search_term, limit=10000)  # Limit set to 1,000

        data = []  # Blank list to hold all the data

        for post in posts:
            # Implementing delay
            time.sleep(1)  # Sleep for 1 second between each request
            # Normalize the post URL and ID for inclusion with every entry
            post_url = post.url
            post_id = post.id

            # Append post title to the list, including ID and URL
            data.append({
                'subreddit': subreddit_name, 
                'type': 'title', 
                'text': post.title.replace('\n', ' '), 
                'id': post_id, 
                'url': post_url
            })
            # print(count) # Debugged
            post.comments.replace_more(limit=0)
            # Append post body to the list, including ID and URL
            data.append({
                'subreddit': subreddit_name, 
                'type': 'body', 
                'text': post.selftext,
                'id': post_id, 
                'url': post_url
            })  # Reduce "more" comments to manage the load
            for comment in post.comments.list():
                # Append each comment to the list, including post ID and URL for context
                data.append({
                    'subreddit': subreddit_name, 
                    'type': 'comment', 
                    'text': comment.body.replace('\n', ' '), 
                    'id': post_id, 
                    'url': post_url
                })
                    
                # Fetch and append replies for each comment, including post ID and URL
                for reply in comment.replies.list():
                    data.append({
                        'subreddit': subreddit_name, 
                        'type': 'reply', 
                        'text': reply.body.replace('\n', ' '), 
                        'id': post_id, 
                        'url': post_url
                    })
            count += 1
        return data
    except APIException as e:
        print(f"API Exception: {e}")

# Fetch data from r/Scams
data_scams = get_posts_with_comments('Scams')

In [11]:
# Print the first 5 rows of r/Scams
data_scams = get_posts_with_comments('Scams')[:5]

data_scams

[{'subreddit': 'Scams',
  'type': 'title',
  'text': 'This is a weird scam',
  'id': '1bnwy7r',
  'url': 'https://www.reddit.com/r/Scams/comments/1bnwy7r/this_is_a_weird_scam/'},
 {'subreddit': 'Scams',
  'type': 'body',
  'text': 'A random person friended me on discord and said hello, I said Hi and then I said "Why did you friend me?" he ignored the question and said "Nothing Much" then asked if he could ask me a question I said sure he then asked me  "If $3000 was deposited to your account what will you spend it on (1) food (2) child (3) bills (4) family (5) car (6) investment (7) your needs (8)rent be honest" I made a new answer number (9) fun because I am a minor and then asked if he was a scammer and he skipped over my question and said " I’m sorry for the late response, Good, What paying app did you have Cash app or PayPal?" I was confused because he said good for no reason and then it hit me. It was an automated message scam. I said Cashapp and he told me to drop the link. I tol

<h2>1.3 Combine Both Data & Convert into a Dataframe</h2>

In [14]:
# Combine and Convert to Dataframe
combined_data = data_randomkindness + data_scams
df = pd.DataFrame(combined_data)

#Export to CSV
df.to_csv('../data/reddit_scraped_data.csv', index=False)