# **Problem Statement**

[`Rivian`](https://rivian.com/), a leading electric vehicle manufacturer in the United States, has hired me to perform a sentiment analysis of two `subreddits` - `r/cars` and `r/electricvehicles` - which it actively monitors on the popular Reddit platform. The problem is that the data previously collected before I was hired were not properly labelled, and so we do not know which of those unlabelled posts came from either subreddit. We do not wish to discard the unlabelled data because they are a substantial number, and due to some issues with [Pushshift's](https://github.com/pushshift/api) API, it was no longer possible to get data earlier than November. My boss has therefore asked me to do the following:

1. Collect more data from the two subreddits, using [Pushshift's](https://github.com/pushshift/api) API.
2. Use Natural Language Processing to train a classifier on which subreddit a given post came from. This is a binary classification problem and I would attempt different models.
3. In future, I can then use my best model to classify the unlabelled posts we have as belonging to either the `r/cars` or `r/electricvehicles` subreddits, and then carry out a sentiment analysis. (This third point is not covered in this project). 
4. Make a presentation to him and the rest of the team outlining my process and findings.

## Metrics for Evaluation

1. **Accuracy of at least 0.90**. Accuracy is a key metric for this project. We ideally do not want the model to misclassify more than 1 in 10 posts otherwise we would not be confident of the model's usefulness.
2. **F1 score of at least 0.90**. The F1 score is important for two reasons. First, we are as interested in correctly predicting posts from the `r/cars` subreddit as we are in correctly predicting posts from the `r/electricvehicles` subreddit. Neither is more important to us. Second, it turned out that our two classes were slightly unbalanced with the `r/cars` subreddit having the majority - 58% - of observations. As a result of this imbalance, accuracy is not a perfect metric. For these reasons, a strong F1 score is useful.


## **Data Collection**

My target is to gather 5,000 posts from each of the subreddits, making a total of 10,000 posts. My thinking is that since my two selected subreddits are pretty similar in theory as they both deal with cars, my models would need substantial data in order to be able to differentiate between posts from either. 

The features to be collected are:
1. `subreddit` - which would be one of either `cars` or `electricvehicles`; 
2. `title` of the post. 
3. `id`.
4. `created_utc` column. This is useful in the data collection stage since there is a limit on the number of posts I can pull using the Pushshift API at any given time. I'll use the `created_utc` column to know the time of the last post pulled so that when I want to collect more data from a particular subreddit, i could use that time to specify. 

Any other column that would be used in this classification problem would be engineered from the columns above. 

## Let's begin with `r/cars`

In [1]:
#imports 
import requests
import calendar
import time
import pandas as pd

In [2]:
#Getting first set of data - set the limit to 1000
url = 'https://api.pushshift.io/reddit/search/submission'
params = {
    'subreddit': 'cars',
    'limit': 1000,
    'filter': 'subreddit, title, created_utc, id'
}

In [3]:
res = requests.get(url, params)
res.status_code

200

In [4]:
res.json()['data']

[{'subreddit': 'cars',
  'title': 'Need help for an article',
  'id': '10b28x0',
  'created_utc': 1673635931},
 {'subreddit': 'cars',
  'title': 'We have a 2014 Nissan Rogue. Just found out about the class action lawsuit from last year. How can I still get benefits from it/can I still claim it? Thank you',
  'id': '10b1jn0',
  'created_utc': 1673634254},
 {'subreddit': 'cars',
  'title': 'Audi manual transmission?',
  'id': '10b14rq',
  'created_utc': 1673633236},
 {'subreddit': 'cars',
  'title': 'How much would a paint job cost to cover up two stripes on a car ?',
  'id': '10b1112',
  'created_utc': 1673632979},
 {'subreddit': 'cars',
  'title': 'What car is Jeezy riding in this music video? at the :40 sec mark',
  'id': '10b0zwu',
  'created_utc': 1673632905},
 {'subreddit': 'cars',
  'title': "This A2Air Android Auto Wireless Adapter $119.99 - What's the Catch?",
  'id': '10b0z2c',
  'created_utc': 1673632854},
 {'subreddit': 'cars',
  'title': 'Can a double din deck from a 2012 Fo

In [6]:
car_1 = pd.DataFrame(res.json()['data'])
car_1.tail()

Unnamed: 0,subreddit,title,id,created_utc
995,cars,Alfa Romeo's Secret Rotary Engine Project | Ro...,105cb83,1673055885
996,cars,Here’s something every car person and non car ...,105cb2y,1673055874
997,cars,Haven’t paid speeding ticket will it pop when ...,105c6ik,1673055526
998,cars,When will people start accepting that most of ...,105c3lq,1673055309
999,cars,what r we thinkin,105c2yp,1673055257


## Looping to obtain more data

In the cell below, I do a few things:
1. I have introduced a new parameter in `params` called `until`. It takes the latest date from the `created_utc` column and uses that as a baseline to collect more 1000 more data points from the API "until" the date provided.
2. I create an empty list called `cars_df`. This list would contain several lists of dataframes that would be fed into it from the results of the `for loop` that follows. 
3. I created a loop over a range(0,4). This loop would request and get 1000 datapoints each time from the API, and then convert the json() object into a dataframe which is then appended to the `cars_df` list.
4. Finally, for each iteration of the loop, the `until` parameter is updated with the latest value in the `created_utc` column as explained above. 

In [8]:
url = 'https://api.pushshift.io/reddit/search/submission'
params = {
    'subreddit': 'cars',
    'limit': 1000,
    'filter': 'subreddit, title, created_utc, id',
    'until': 1673055257
}


cars_df = []
for i in range(0,4):
    res = requests.get(url, params)
    cars_df.append(pd.DataFrame(res.json()['data']))
    
    #the utc parameter needs to be updated each time
    params['until'] = cars_df[i]['created_utc'].min()
    

In [9]:
#this is what my list of lists look like
cars_df

[    subreddit                                              title       id  \
 0        cars                  Family of 4 (+3) New Car &lt;$50k  105c1c3   
 1        cars                            What car should I get ?  105bclx   
 2        cars  91 vs 95 gas for a 2023 suzuki vitara 4x2 1600cc?  105bbsp   
 3        cars                                        I need help  105b79k   
 4        cars                               Best song on horizon  105b4mp   
 ..        ...                                                ...      ...   
 995      cars  What You Need to Know About the Vehicle Kill S...  100hmo7   
 996      cars                                    Don’t read this  100hli1   
 997      cars  DJay Pro 3.1.5 Crack With Serial Key Free Down...  100hh7s   
 998      cars  ByteFence 5.7.1.1 Crack Plus Registration Code...  100hfbh   
 999      cars  DriverDoc 5.3.521 Crack + License Keygen Free ...  100hcxj   
 
      created_utc  
 0     1673055137  
 1     1673053294  
 2

In [10]:
#My cars_df is a list of dataframes. I will now concatenate into one large dataframe
cars_df = pd.concat(cars_df)

In [11]:
#my cars_df is now a DataFrame as opposed to a list of lists - good
type(cars_df)

pandas.core.frame.DataFrame

In [12]:
#Now I will concat this new dataframe with my first dataframe - `car_1` which I obtained before doing the loop
#Concatenating it all into a dataframe called `cars`

cars = pd.concat([car_1, cars_df], ignore_index=True)
cars

Unnamed: 0,subreddit,title,id,created_utc
0,cars,Need help for an article,10b28x0,1673635931
1,cars,We have a 2014 Nissan Rogue. Just found out ab...,10b1jn0,1673634254
2,cars,Audi manual transmission?,10b14rq,1673633236
3,cars,How much would a paint job cost to cover up tw...,10b1112,1673632979
4,cars,What car is Jeezy riding in this music video? ...,10b0zwu,1673632905
...,...,...,...,...
4995,cars,Full list of performance modifications on my 2...,zhh85q,1670641717
4996,cars,Car community / Dodge,zhh4qn,1670641424
4997,cars,I don’t have a car but I want to learn to driv...,zhgvzx,1670640736
4998,cars,2001 ford expedition traction control,zhgmpt,1670640001


As expected, our `cars` dataset contains 5000 observations.

### Now that we have that sorted, we can move on to data collection for our second subreddit - `r/electricvehicles` 

We are going to go through the same process as before, viz:
1. Request for 1000 posts from the Pushshift API
2. Loop to get more datapoints, changing the `until` parameter for each iteration. 

In [21]:
url = 'https://api.pushshift.io/reddit/search/submission'
ps = {
    'subreddit': 'electricvehicles',
    'limit': 1000,
    'filter': 'subreddit, title, created_utc, id'
}

In [22]:
response = requests.get(url, ps)
response.status_code

200

In [23]:
response.json()['data']

[{'subreddit': 'electricvehicles',
  'title': "CNBC visits Chile's Albemarle Salar Plant, one of only two lithium mines in the country",
  'id': '10b28kt',
  'created_utc': 1673635909},
 {'subreddit': 'electricvehicles',
  'title': 'Hackers Gained Access to California’s Digital License Plates',
  'id': '10b1moi',
  'created_utc': 1673634458},
 {'subreddit': 'electricvehicles',
  'title': 'Is $7,500 federal credit still twice in a lifetime thing?',
  'id': '10b1iu5',
  'created_utc': 1673634199},
 {'subreddit': 'electricvehicles',
  'title': 'These guys upgraded the battery on a model x!',
  'id': '10b1dwq',
  'created_utc': 1673633865},
 {'subreddit': 'electricvehicles',
  'title': 'Tesla inviting people to hack into its cars for $600,000 Pwn2Own prize',
  'id': '10b133p',
  'created_utc': 1673633121},
 {'subreddit': 'electricvehicles',
  'title': 'The ‘Hand of God reaches the future electric all-wheel drive BMW M',
  'id': '10b0n5w',
  'created_utc': 1673632052},
 {'subreddit': 'elect

In [25]:
ev_1 = pd.DataFrame(response.json()['data'])
ev_1.tail()

Unnamed: 0,subreddit,title,id,created_utc
995,electricvehicles,EVgo will be down tonight starting at 3 am ET ...,zvvua6,1672085948
996,electricvehicles,EVgo will be down early tomorrow morning for m...,zvvrp6,1672085752
997,electricvehicles,[USA] EVgo will be down for maintenance early ...,zvvo64,1672085485
998,electricvehicles,Planned EVgo Outage,zvvmfp,1672085359
999,electricvehicles,Any buzz on NA ID buzz?,zvvl9v,1672085270


In [27]:
url = 'https://api.pushshift.io/reddit/search/submission'
params = {
    'subreddit': 'electricvehicles',
    'limit': 1000,
    'filter': 'subreddit, title, created_utc, id',
    'until': 1672085270
}


ev_df = []
for i in range(0,3):
    response = requests.get(url, params)
    ev_df.append(pd.DataFrame(response.json()['data']))
    
    #the utc parameter needs to be updated each time
    params['until'] = ev_df[i]['created_utc'].min()
    

In [28]:
#My list of lists
ev_df

[            subreddit                                              title  \
 0    electricvehicles  Sono Motors received more than 1,000 full depo...   
 1    electricvehicles  Progress report: When our Dual-Motor AWD syste...   
 2    electricvehicles  How do Non Tesla EVs think about expansion eve...   
 3    electricvehicles             Percentage of electric cars in the US:   
 4    electricvehicles                                 eSprinter web site   
 ..                ...                                                ...   
 995  electricvehicles  Farmer Boys near me has dedicated EV parking s...   
 996  electricvehicles  You're Being Lied to About Electric Cars. Scie...   
 997  electricvehicles  Tennessee considers tripling fee for owning el...   
 998  electricvehicles  BYD Seal Review — Could this be one of the fir...   
 999  electricvehicles  If a person presses the gas pedal on an electr...   
 
          id  created_utc  
 0    zvuwzl   1672083461  
 1    zvuk24   167

In [29]:
#concatenating into one dataframe
ev_df = pd.concat(ev_df, ignore_index=True)
ev_df

Unnamed: 0,subreddit,title,id,created_utc
0,electricvehicles,"Sono Motors received more than 1,000 full depo...",zvuwzl,1672083461
1,electricvehicles,Progress report: When our Dual-Motor AWD syste...,zvuk24,1672082488
2,electricvehicles,How do Non Tesla EVs think about expansion eve...,zvt8fn,1672078984
3,electricvehicles,Percentage of electric cars in the US:,zvscco,1672076637
4,electricvehicles,eSprinter web site,zvs6ts,1672076215
...,...,...,...,...
2655,electricvehicles,China EV Sales: BYD Achieves All-Electric Mile...,ylfi28,1667510667
2656,electricvehicles,BMW i3 - considering buying a 2015,yles4u,1667509199
2657,electricvehicles,DRAG RACING OUR GOLF CARTS! They say electric ...,ylcrgv,1667504982
2658,electricvehicles,Ceer Is Saudi Arabia's First Homegrown EV Bran...,ylcims,1667504451


As you can see, we do not exactly get 3000 rows as hoped. This is because the PushShift API doesn't go as far back at this time. So it appears this is all the data we've got and we are stuck with an unbalanced dataset where the samples from the `cars` dataframe outnumber those from the `electricvehicles` dataframe. I could decide to remove some rows from the other dataframe to make it more even but I do not think that is ideal. So I am going to proceed with the unbalanced classes. However, I'll ensure to stratify when splitting into training and test sets, and i would also ensure to note the unbalanced nature of my classes when checking for the `accuracy` of my model.

Moving on, below i concat my two ev dataframes into one. This new dataframe is expected to have `1000+2660 = 3660` rows.

In [30]:
ev = pd.concat([ev_1, ev_df], ignore_index=True)
ev.shape

(3660, 4)

In [31]:
ev.tail()

Unnamed: 0,subreddit,title,id,created_utc
3655,electricvehicles,China EV Sales: BYD Achieves All-Electric Mile...,ylfi28,1667510667
3656,electricvehicles,BMW i3 - considering buying a 2015,yles4u,1667509199
3657,electricvehicles,DRAG RACING OUR GOLF CARTS! They say electric ...,ylcrgv,1667504982
3658,electricvehicles,Ceer Is Saudi Arabia's First Homegrown EV Bran...,ylcims,1667504451
3659,electricvehicles,Quantumscape Announces New Cell Format for Sol...,ylchfm,1667504385


### Putting it all together

Below I do three things:
- Concatenated the two subreddit dataframes into one, then
- Dropped the `created_utc` column. Its sole purpose was to aid data collection from the API, and so I no longer have need for it, then
- Dropped duplicate rows from the combined dataframe. 

In [32]:
ev_and_cars = pd.concat([cars, ev], axis=0, ignore_index=True).drop(columns=['created_utc']).drop_duplicates(ignore_index=True)
print(ev_and_cars.shape)
print(ev_and_cars.head())
print('-'*10)
ev_and_cars.tail()

(8660, 3)
  subreddit                                              title       id
0      cars                           Need help for an article  10b28x0
1      cars  We have a 2014 Nissan Rogue. Just found out ab...  10b1jn0
2      cars                          Audi manual transmission?  10b14rq
3      cars  How much would a paint job cost to cover up tw...  10b1112
4      cars  What car is Jeezy riding in this music video? ...  10b0zwu
----------


Unnamed: 0,subreddit,title,id
8655,electricvehicles,China EV Sales: BYD Achieves All-Electric Mile...,ylfi28
8656,electricvehicles,BMW i3 - considering buying a 2015,yles4u
8657,electricvehicles,DRAG RACING OUR GOLF CARTS! They say electric ...,ylcrgv
8658,electricvehicles,Ceer Is Saudi Arabia's First Homegrown EV Bran...,ylcims
8659,electricvehicles,Quantumscape Announces New Cell Format for Sol...,ylchfm


Apparently there were no duplicate rows in my dataset, so I still end up with 8660 rows. That's good.

### Saving of dataframes as csv files to be used for my analysis

In [33]:
#Saving the unclean cars data
cars.to_csv('../data/cars_unclean.csv', index=False)

In [34]:
#Saving the unclean ev data
ev.to_csv('../data/ev_unclean.csv', index=False)

In [35]:
#Saving the combined unclean data
ev_and_cars.to_csv('../data/ev_cars_unclean.csv', index=False)