After deciding to focus on my Github portfolio, I made a mini-project to demonstrate the skills we learned for the week. This week, we covered web scraping (BeautifulSoup and MongoDB), natural language processing (NLP), time series, and clustering. 

So, I decided to have some fun with Craigslist's Missed Connections to practice:
+ web scraping
+ database management
+ data visualization
+ NLP
+ clustering

I didn't do any time series analysis since that would require a pretty healthy dataset to play with, and my script is only a week old... 

To start off- let's talk about scraping Craigslist. I didn't think Missed Connections would be too difficult to scrape because I assumed the spam content to be relatively low compared to other Craigslist categories. This was my first time building a scraper, and I learned a ton in the process:

1. **Get used to iterating.** I'm not sure how long I thought it would take to scrape and curate my own dataset, but it definitely took longer than I expected. My process seemed it would be simple: build a scraping script, build a database to store info, then add some functionality to the scraper to retrieve only new posts that aren't in the database. Each step required lots of iterating, and of course, there's always room for improvement and more iterations...
   
2. **Building a "friendly" scraper.** To avoid getting caught and banned for overloading Craigslist's server, I put a delay between each request to view the next page. I used the `sleep` function from python's `time` module to emulate a person spending 10-13 seconds on each page with `sleep(10 + 3*np.random.random())`. It might be a bit on the conservative side, but I've managed to store 1153 posts so far without getting kicked off!

3. **Figuring out your database needs.** Even though we had learned about MongoDB this week, it felt like overkill for my needs. I wasn't going to be storing much, and my table schema was pretty straight-forward, so I went with Postgres. My friend shared some helpful resources to help me make these decisions again in the future: 

    [When to Use MongoDB Rather than MySQL (or Other RDBMS): The Billing Example](https://dzone.com/articles/when-use-mongodb-rather-mysql)<br><br>
    
    <figure><img src='http://i.stack.imgur.com/iMkdg.png'><figcaption><i><a href='http://i.stack.imgur.com/iMkdg.png'>Visual Guide to NoSQL Systems</a></i></figcaption></figure>
    
    I also learned that Python's `sqlalchemy` module is much nicer to use than `psycopg2`. Good god. Goodbye rollbacks and commits!

***

I'll give a small demonstration of my project, but you can check out my code here: [https://github.com/stong1108/CL_missedconn](https://github.com/stong1108/CL_missedconn)

First, importing things:

In [1]:
from MissedConn import MissedConn
from maps import *
from manage_db import *

Next, we create a MissedConn object- I'm initializing it for just the 'sfc' subcategory of sfbay's missed connections for minimal scraping during this demonstration.

In [None]:
mc = MissedConn('https://sfbay.craigslist.org/search/sfc/mis')
df = mc.get_df()

Then, we can make an interactive Folium Map to visualize and read the Missed Connection posts that included latitude & longitude data (posts that contain maps).

I haven't decided if/how to represent the remaining posts- some include a neighborhood, and some don't or provide a location that's not super helpful. There's another function `make_heat_map(df)` in `maps.py` for making heat maps that is pretty similar to `make_pinned_map(df)` for now. The pinned map is more fun to me (and I'm especially proud of the markers).

Click on the pins to read the posts!

In [3]:
m = make_pinned_map(df)
m

After playing with this Missed Connections set, we can upload it to our Postgres database with the `update_db(df)` function from `manage_db.py`. We can also grab a DataFrame of everything in our database to play with using `db_to_df()`. 

In [None]:
update_db(df)
df_all = db_to_df()

Now we can move on to NLP and clustering!

First, we tokenize the words in our posts and vectorize each post based on its word content using TF-IDF. TF-IDF stands for Term Frequency-Inverse Document Frequency. I found this nomenclature a little annoying- "frequency" was misleading to me since it makes me think about dividing by some cycle length. It's more like Term Count - Inverse Document Count. You count how many times a word has occured in a specific document, and divide it by the number of times a word has occured in all documents. This tells you how important a word is within a specific post compared to all posts.

With vectorized posts, we can calculate which posts are most similar to each other by cosine similarity. Since each post is represented as a vector, we can compare them pairwise to find which two are most "aligned". The cosine similarities range from 0 to 1, since we have no negative values in our vectors (term frequency and document frequency can't be negative).

In [None]:
print_most_cos_sim(df_all)

Heh. I think my function works. There is a lot of "missing" going on in these two posts. 

Lastly, I used k-means clustering to try to group similar posts together and looked at what words were the most representative of each cluster. 

In [None]:
print_clustered_words(df_all, 5)

Meh. I looked at the actual posts too- this time for a smaller number of clusters in the interest of getting a digestible output.

In [None]:
print_clustered_posts(df_all, 3)

My clustering results weren't that interesting- the underlying themes for each cluster weren't clear to me. Looks like I need 1) a larger corpus to work with, and 2) a different vectorizing method that doesn't strictly compare individual words. I'll explore word2vec to look at word embeddings- it can capture relationships between words that are used in similar contexts.