# Cleaned Parsing Functions

In [1]:
# Import necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
import re
import dateutil.parser
import datetime
import json


## Overview

Our goal is to extract relevant information on every article tagged with the `Codingbootcamp` tag.  
We accomplish this in three steps:
1. Iterate through each page of the archive for this tag.
2. On each date page (such as 2018/01/02), extract the urls for all links that correspond to articles published on that date tagged with `Codingbootcamp` (don't extract "Home", "signup" links etc).
3. On each url, call the single article parsing function to extract author, author_bio, title, publish date, publisher, and article text.

## Get HTML Function
**Base Source**: https://realpython.com/python-web-scraping-practical-introduction/  
**Update10/1/18**: This code originated from this tutorial but was later adapted in order complete step 3 - to get all articles tagged with `Codingbootcamp`. 
I added a second get html function that checked whether the web browser redirected you at some point when attempting to access the html at the provided link. This was necessary because when you attempted to access a page of the archive that didn't exist - such as attempting to access `archive/2013/03/02`, you are instead redirected back to the year or month page. In those situations, I didn't want to re-parse the articles on the main page, which would have been redundant parsing (and extra time in an already long function, and instead didn't parse a page that had been redirected.  
I had to keep this in a seperate function, because for certain single article urls, like "https://medium.freecodecamp.org/5-key-learnings-from-the-post-bootcamp-job-search-9a07468d2331" were redirected in the retreival process (I am not sure why). I attempted to come up with a elegant solution using response object attributes that would differentiate between article redirects and archive redirects, but was unable to do so. 
So the function that retrieves the html for articles doesn't check for redirects, but the function that retrieves archive page html does check for redirects. 

In [2]:
# Define get function to get raw HTML
def simple_get_article(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        #closing ensures any network resources are freed when out of scope - good practice
        with closing(get(url, stream=True)) as resp: 
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None

# Define get function to get raw HTML from the archive pages, with a twist.
def simple_get_archive(url):
    """
    Retrieves the raw html of a page in the `Codingbootcamp` tag archive (https://medium.com/tag/codingbootcamp/archive).
    But some pages of the archive don't exist (like 2013/01/02) because no stories were published on that date. 
    If you attempt to access these nonexistent urls, you are redirected to the main year or month page.
    We don't want to re-parse those pages (redundant code), so our hack-y solution is to check to see if the HTML response
    object has a redirect (302 status code) in its history. 
    This is defined as a separate function, because some of the article urls are redirected (I am not sure why), and I couldn't 
    come up with a clean solution that separates article redirection vs archive redirection. 
    """
    try:
        #closing ensures any network resources are freed when out of scope - good practice
        with closing(get(url, stream=True)) as resp: 
            if is_redirect(resp):
                return "redirect"
            elif is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None
    
# I added this helper function to check if the HTML response returned by the browser had been redirected at some point
# See http://docs.python-requests.org/en/v0.10.6/api/ for docs on history attribute
def is_redirect(resp):
    """
    Returns True if the resp had been redirected at some point in the retrieval process due to nonexistent url, 
    False otherwise. 
    Arguments: a HTML response object
    """
    resp_history = resp.history
    if resp_history: #not empty - some things happend before response was returned
        # I specifically want to check for redirects (status code 302)
        statuses = [h.status_code==302 for h in resp_history]
        # Are there any true in the above list comp? Then something was redirected. 
        return np.any(statuses)
    else:
        return False
    
def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200 
            and content_type is not None 
            and content_type.find('html') > -1)


def log_error(e):
    """
    It is always a good idea to log errors. 
    This function just prints them, but you can
    make it do anything.
    """
    print(e)

## Single Article Info Parsing

The following functions return information parsed from a single article on Medium. 

In [3]:
def get_author(parsed_html):
    """Parses the author name from a Medium article. 
    Arguments:
        parsed_html: object returned by calling `BeautifulSoup(raw_html, 'html.parser')`
    Returns: 
        the author name as a string or None if author tag not present
    """
    author = parsed_html.find('meta', property="author")
    return author['content'] if author else None

def get_author_bio(parsed_html):
    """Parses the author's bio/description from a Medium article. 
    Arguments:
        parsed_html: object returned by calling `BeautifulSoup(raw_html, 'html.parser')`
    Returns: 
        the author bio/description as a string if it exists, and None otherwise
    """
    bios = parsed_html.find_all('div', class_="ui-caption ui-xs-clamp2 postMetaInline")
    # If bios is empty, that means there is no author bio for article, and the [0]  will error so we need to explicityly
    # check and return None if no author bio
    return bios[0].text if bios else None

def get_title(parsed_html):
    """Parses the title of a Medium article.
    Arguments:
        parsed_html: object returned by calling `BeautifulSoup(raw_html, 'html.parser')`
    Returns: 
        the title of the article as a string or None if title tag not present
    """
    title = parsed_html.find('meta', property='og:title')
    return title['content'] if title else None

def get_raw_publish_date(parsed_html):
    """Parses the date a Medium article was published. 
    Arguments:
        parsed_html: object returned by calling `BeautifulSoup(raw_html, 'html.parser')`
    Returns: 
        a raw/uncleaned publish date, which looks like '2016-11-19T16:48:30.365Z', or None if tag not present
    """
    date = parsed_html.find('meta', property='article:published_time')
    return date['content'] if date else None

def get_article_publisher(parsed_html):
    """Parses the article's publisher from a Medium article. 
    Arguments:
        parsed_html: object returned by calling `BeautifulSoup(raw_html, 'html.parser')`
    Notes:
        The publisher is encoded as "https://facebook.com/publisher" so I extract just the publisher name.
        Not all articles have a verified publisher - like if it's just the author's personal blog - so publisher is
        just "medium" in that case
    Returns:
        If article is hosted on verified publisher, returns publisher name as string
        If article is on personal blog, returns "medium" as a string. 
        If publisher tag doesn't exist, returns None
        
    """
    long_publisher = parsed_html.find("meta", property='article:publisher')
    return long_publisher['content'].split("/")[3] if long_publisher else None

def get_raw_article_text(parsed_html):
    """Extracts out the text/content of the Medium article.
    Arguments:
        parsed_html: object returned by calling `BeautifulSoup(raw_html, 'html.parser')`
    Returns:
        a raw/uncleaned string of text or None if tag is not present.
        The string is considered "raw" because there are some weird characters that are remants
        of header formatting and the like.
    """
    text = parsed_html.find_all('div', class_='postArticle-content')
    return text[0].text if text else None

def clean_text(text):
    """takes a string of text
    removes \xa0 and \u200a that are randomly splattered throughout text
    adds spaces after punctuation that appeared to be missing
    splits all words where a capital letter is in the middle of a word and puts a space in front of it and removes double spaces
    """
    if text:
        cleaned_text = text.replace("\xa0", " ").replace("\u200a", " ")
        cleaned_text = re.sub(r'(?<=[.,])(?=[^\s])', r' ', cleaned_text)
        cleaned_text = re.sub(r'([A-Z])', r' \1', cleaned_text).lstrip().replace("  ", " ")
        return cleaned_text
    else:
        # text is a None object which you can't use regex on.
        return text

def clean_date(date):
    """takes a string in RFC 339 format ('Y-M-D"T"H:M:S.MS"Z"')
    returns a string of format ('Y-M-D H:M:S')
    """
    if date:
        date = dateutil.parser.parse(date)
        date = date.strftime("%Y-%m-%d %H:%M:%S")
        return date
    else:
        # Date is a None object and you can't parse it
        return date
    
def get_tags(parsed_html):
    script_html = parsed_html.find("script", type='application/ld+json')
    script_data = json.loads(script_html.text)
    keywords_list = script_data['keywords']
    tags = []
    for elem in keywords_list:
        if elem.startswith("Tag:"):
            tags.append(elem[4:])
    return tags

def get_url(parsed_html):
    url = parsed_html.find("meta", property="al:web:url")
    return url["content"]


def get_all_article_info(article_url):
    """Parses a Medium article to get all needed information about author and story.
    Arguments:
        article_url: String of url for article to parse
    Returns:
        list of [author, author_bio, title, date, publisher article_text], where each component is a string or None
    """
    raw_html = simple_get_article(article_url)
    parsed_html = BeautifulSoup(raw_html, 'html.parser')
    author = get_author(parsed_html)
    author_bio = get_author_bio(parsed_html)
    title = get_title(parsed_html)
    raw_date = get_raw_publish_date(parsed_html)
    cleaned_date = clean_date(raw_date)
    publisher = get_article_publisher(parsed_html)
    raw_text = get_raw_article_text(parsed_html)
    cleaned_text = clean_text(raw_text)
    tags = get_tags(parsed_html)
    url = get_url(parsed_html)
    return [author, author_bio, title, cleaned_date, publisher, cleaned_text, tags, url]

In [4]:
## working on getting the tags of the article
# article_url = "https://medium.com/launch-school/were-not-a-bootcamp-c33901412c38"
# raw_html = simple_get_article(article_url)
# parsed_html = BeautifulSoup(raw_html, 'html.parser')
# # print(parsed_html)
# script_html = parsed_html.find("script", type='application/ld+json')
# long_publisher = parsed_html.find("meta", property='article:publisher')['content']
# # print(long_publisher)
# ## extract content of script https://stackoverflow.com/questions/26192727/extract-content-of-script-with-beautifulsoup
# script_data = json.loads(script_html.text)
# keywords_list = script_data['keywords']
# tags = []
# for elem in keywords_list:
#     if elem.startswith("Tag:"):
#         tags.append(elem[4:])
# # print(tags)

# def get_tags(parsed_html):
#     script_html = parsed_html.find("script", type='application/ld+json')
#     script_data = json.loads(script_html.text)
#     keywords_list = script_data['keywords']
#     tags = []
#     for elem in keywords_list:
#         if elem.startswith("Tag:"):
#             tags.append(elem[4:])
#     return tags

# get_tags(parsed_html)

In [5]:
## working on getting the url
# article_url = "https://medium.com/launch-school/were-not-a-bootcamp-c33901412c38"
# raw_html = simple_get_article(article_url)
# parsed_html = BeautifulSoup(raw_html, 'html.parser')
# def get_url(parsed_html):
#     url = parsed_html.find("meta", property="al:web:url")
#     return url["content"]
# get_url(parsed_html)


### Single Article Example Usage

In [6]:
test_url_1 = "https://medium.com/launch-school/were-not-a-bootcamp-c33901412c38"
get_all_article_info(test_url_1)

['Chris Lee',
 None,
 "We're Not a Bootcamp – Launch School – Medium",
 '2018-08-01 01:48:19',
 'medium',
 'We’re Not a Bootcamp We’re Something Unique, and Uniquely Effective Photo by Kyle Johnson on Unsplash One of the things about operating in a crowded marketplace is that you tend to get lumped in with the biggest names and most common stereotypes. In the programming education space, that means the now familiar label of “bootcamp. ” I often see people refer to Launch School as a “coding bootcamp, ” which may seem like a reasonable shortcut for helping people understand what we do; however, it fails to capture what makes us special. When I talk about Launch School and what we are trying to achieve, I don’t use the word “bootcamp. ” We are an online school for developers, but more than that, we are a school with an opinionated pedagogy that focuses on fundamentals first with the goal of building skills that last a career. So, why do I steer clear of the “bootcamp” label? To put it si

### Get all open post links from a page

In [7]:
def get_all_open_post_links(parsed_html):
    """Retrieves all links on a page that open to an article (does not return "home", "sign up" links etc).
    Arguments:
        parsed_html: object returned by calling `BeautifulSoup(raw_html, 'html.parser')`
    Notes:
        All links on a page that have the data-attribute 'open-post' are the types of links we want.
        find_all returns a special BeautifulSoup object so I need to extract the string url.
        I think there are muliple links for each post (like the title and "read more"), so returning array 
        has duplicates
    Returns:
        List of link strings
    """
    def href_open_post_data_action(tag):
        """Helper parsing function that checks to see if an "a" tag is an href with an 'open-post' data action.
        Arguments:
            tag: All html tags like <a>, <div> that BeautifulSoup has extracted
        Returns:
            true if tag has href and 'open-post' data-action attribute, false otherwise
        """
        return tag.has_attr('href') and tag.get('data-action') == 'open-post'
    
    link_objects =  parsed_html.find_all(href_open_post_data_action)
    return [link.get('href') for link in link_objects] #retrieves the string url from link object

The above function is the "middle level" function in the grand scheme of our parsing.  
First, we defined the function that parses information for a single article.  
The next step is to retrieve all the links to articles on a web page (like the web page returned by searching "coding bootcamp" for example), and then call our single article parsing function on each link we extracted from the page.  
Note - the name is "open post" links because there are a ton of links on a webpage - to "home", to "sign-up", to "search" etc. We are only interested in extracting the links that correspond to articles about coding bootcamps.  
After inspecting the source html, I discovered that all links that open to coding bootcamp articles have the attribute "data-action" set to "open-post" within the html `<a href>` tag. So this function ensures that it only extracts the links we want. 

## Iterate Through All Dates in Archive

The next step is to define a function that will iterate through each "date" page in the archive, gather all the open post links for each "date" page (i.e. all the story links), and then call the single article parser on each link.  
**NOTE**: The following function is currently defined such that it will *ONLY* work on the archive for the `Codingbootcamp` (https://medium.com/tag/codingbootcamp/archive) because currently it operates on the assumption that the years 2013 and 2014 have so few articles that they are not seperated by month or day, only year. 

In [8]:
def format_dates_for_url(integer_date):
    """
    Formats dates to enter into the archive url, which requires numbers < 10 to be encoded with a "0" in front,
    but Python doesn't allow 01 as an integer. 
    """
    if integer_date < 10:
        return str(0) + str(integer_date)
    else:
        return str(integer_date)
def get_specific_Codingbootcamp_links():
    # For current range of years available
    years = range(2013, 2019)
    months_w_o_zero_in_front = range(1, 13)
    days_w_o_zero = range(1, 32)
    base_url = "https://medium.com/tag/codingbootcamp/archive"
    # Collector variable to story all story urls
    story_links = []
    for year in years:
        # These years don't have month, day subdivisions so to prevent redundant parsing, 
        # just parse the base year and its links.
        if year == 2013 or year == 2014:
            # 
            url = base_url + "/" + format_dates_for_url(year)
            raw_year_html = simple_get_archive(url)
            # If raw_year_html is None, can't parse it so skip
            if not raw_year_html:
                break
            parsed_year_html = BeautifulSoup(raw_year_html, 'html.parser')
            links = get_all_open_post_links(parsed_year_html)
            #use extend instead of append because just want to add elements, not create nested lists.
            story_links.extend(links) 
        else:
            for month in months_w_o_zero_in_front:
                for day in days_w_o_zero:
                    url = base_url + "/" + format_dates_for_url(year) + format_dates_for_url(month) + "/" + format_dates_for_url(day)
                    raw_day_html = simple_get_archive(url)
                    # 2015, 2016, 2017, 2018 have some dates with no stories, so GET requests are redirected
                    # and we don't want to parse stuff we already did.
                    if raw_day_html == "redirect":
                        #skip the redirected day - advance in for loop
                        break
                    # If raw_day_html is None, can't parse it so skip
                    elif not raw_day_html:
                        break
                    parsed_day_html = BeautifulSoup(raw_day_html, 'html.parser')
                    links = get_all_open_post_links(parsed_day_html)
                    #use extend instead of append because just want to add elements, not create nested lists.
                    story_links.extend(links)
    # Create a set of links to remove duplicates and then turn back into a list (better data structure)
    return list(set(story_links))

### Collect Text Data

In [9]:
# List of urls for all stories tagged with `Codingbootcamp`
story_links = get_specific_Codingbootcamp_links()

In [10]:
# Get all info like article, publisher, etc for each url gathered above
# THIS CELL TAKES FOREVER BE WARNED - DO NOT CLOSE COMPUTER WHILE RUNNING ELSE THE CONNECTION BREAKS
story_info_list = []
for url in story_links:
    story_info_list.append(get_all_article_info(url))

In [11]:
column_names = ["author", "author_bio", "title", "date", "publisher", "text", "tags", "url"]
Codingbootcamp_info = pd.DataFrame(story_info_list, columns = column_names)

In [12]:
Codingbootcamp_info

Unnamed: 0,author,author_bio,title,date,publisher,text,tags,url
0,Kalen Hammann,,"A New Beginning, Part 22 – Kalen Hammann – Medium",2017-08-07 12:51:18,medium,"A New Beginning, Part 22 W E E K 22: Already?!...","[Web Development, Codingbootcamp, Code Review,...",https://medium.com/@kalen7/a-surprising-career...
1,Mike Brave,I make things with design and code. Here's to ...,Why I’m Taking a Chance on 42 – Mike Brave – M...,2018-08-06 07:04:49,medium,Why I’m Taking a Chance on 42 Edit: This is pa...,"[Life Lessons, 42, Codingbootcamp, Ecole 42]",https://medium.com/@themichaelbrave/why-im-tak...
2,Irsan Sebastian,,Let’s start a journey ! – Irsan Sebastian – Me...,2016-11-05 14:19:30,medium,"Let’s start a journey ! Nama gue Irsan, gue se...","[Story, My Life, Codingbootcamp, Coding, Startup]",https://medium.com/@irsansebastian/lets-start-...
3,KeepCoding,We create the best learning experience for Ful...,KeepCoding Bootcamp KickOff Event – KeepCoding...,2017-05-16 12:38:42,medium,Keep Coding Bootcamp Kick Off Event2017 Mobile...,"[Programming, Codingbootcamp, Web Development,...",https://medium.com/@KeepCoding_/keepcoding-boo...
4,Sabio Coding Bootcamp,Lead by the most senior coding bootcamp staff ...,Coding Bootcamp Student Tells All – Sabio Codi...,2018-09-14 07:33:30,wesabio,Coding Bootcamp Student Tells All Sabio coding...,"[Veterans In Tech, Codingbootcamp]",https://blog.sabio.la/coding-bootcamp-student-...
5,Mike Brave,I make things with design and code. Here's to ...,42 Piscine Day 11 — (08) – Mike Brave – Medium,2018-08-31 04:33:26,medium,42 Piscine Day 11 — (08) Edit: This is part of...,"[Life Lessons, Ecole 42, Education, Codingboot...",https://medium.com/@themichaelbrave/42-piscine...
6,Better Developer,Sharing my experience in what it takes to beco...,Insight to Help With the Role You Want to Have...,2017-09-04 20:47:04,medium,Insight to Help With the Role You Want to Have...,"[Software Development, Learning To Code, Learn...",https://medium.com/@better_developer/insight-t...
7,adam tropp,,CS 100.3: Linked Lists – adam tropp – Medium,2018-05-16 21:39:10,medium,"C S 100. 3: Linked Lists Without further ado, ...","[Programming, Codingbootcamp, Computer Science]",https://medium.com/@adt6261/cs-100-3-linked-li...
8,Verity Honebon,,Good Code Karma – Verity Honebon – Medium,2017-07-17 23:32:17,medium,Good Code Karma When I was a kid I believed in...,"[Programming, Coding, Codingbootcamp, Collabor...",https://medium.com/@verityhonebon/good-code-ka...
9,Manchester Codes,https://mcr.codes,An evening at Manchester Codes — What to expec...,2018-03-19 18:36:41,medium,An evening at Manchester Codes — What to expec...,"[Programming, Manchester Codes, Manchester, Co...",https://medium.com/@MCRcodes/an-evening-at-man...


In [13]:
#Convert to csv so I don't have to run the collection function again - it takes FOREVER
Codingbootcamp_info.to_csv("codingbootcamp_articles_info.csv")

In [14]:
#see frequency of different tags
Codingbootcamp_info["tags"].value_counts()

TypeError: unhashable type: 'list'

We cannot get the value_counts of a list object so we need to get a little creative in order to get the value_counts of the elements in the list (tag column).

In [15]:
newdf = Codingbootcamp_info["tags"].apply(pd.Series).stack().reset_index(level=1, drop=True).to_frame('tags')
newdf["tags"].value_counts()

Codingbootcamp          1002
Coding                   476
Programming              412
Web Development          207
JavaScript               177
Learning To Code         101
Education                 90
Women In Tech             74
Software Development      60
Tech                      49
Technology                49
Ruby                      48
Coding Bootcamps          42
The Iron Yard             41
Startup                   36
Career Change             35
Ruby on Rails             35
Internships               34
Bootcamp                  32
Freecodecamp              31
Learn To Code             30
Coding Bootcamp           29
Ecole 42                  29
Code Newbie               28
Code                      27
React                     26
Chingu                    24
Life Lessons              24
Women Who Code            22
Computer Science          21
                        ... 
Meaningful Work            1
Girls                      1
Adversity                  1
Frustration   

I will play around with this more later, but for now let's look at gender of the names! We need more data than just the author name, in order to tell if they are male or female, so I am going to import the babynames data from the SSN website which may have information that is useful.

In [16]:
import urllib.request
import os.path

data_url = "https://www.ssa.gov/oact/babynames/names.zip"
local_filename = "babynames.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

The data is organized into separate files in the format `yobYYYY.txt` with each file containing the `name`, `sex`, and `count` of babies registered in that year. Now we load the data directly into Python without unzipping in order to increase efficiency.

In [17]:
import zipfile
babynames = [] 
with zipfile.ZipFile(local_filename, "r") as zf:
    data_files = [f for f in zf.filelist if f.filename[-3:] == "txt"]
    def extract_year_from_filename(fn):
        return int(fn[3:7])
    for f in data_files:
        year = extract_year_from_filename(f.filename)
        with zf.open(f) as fp:
            df = pd.read_csv(fp, names=["Name", "Sex", "Count"])
            df["Year"] = year
            babynames.append(df)
babynames = pd.concat(babynames)

In [18]:
babynames.head()

Unnamed: 0,Name,Sex,Count,Year
0,Mary,F,7065,1880
1,Anna,F,2604,1880
2,Emma,F,2003,1880
3,Elizabeth,F,1939,1880
4,Minnie,F,1746,1880


Yay now we have a babynames dataset to use!

Let's start off my making the names in the babynames dataset and our own data frame, lowercase (for standardization purposes so we can easily find patterns and matches). I'll create a new column for the lowercase name so we don't lose the original data.

In [46]:
babynames["LName"] = babynames["Name"].str.lower()
babynames.head()

Unnamed: 0,Name,Sex,Count,Year,Lowercase Name,prob_female,LName
0,Mary,F,7065,1880,mary,,mary
1,Anna,F,2604,1880,anna,,anna
2,Emma,F,2003,1880,emma,,emma
3,Elizabeth,F,1939,1880,elizabeth,,elizabeth
4,Minnie,F,1746,1880,minnie,,minnie


In [20]:
Codingbootcamp_info["lowercase author"] = Codingbootcamp_info["author"].str.lower()
Codingbootcamp_info.head()

Unnamed: 0,author,author_bio,title,date,publisher,text,tags,url,lowercase author
0,Kalen Hammann,,"A New Beginning, Part 22 – Kalen Hammann – Medium",2017-08-07 12:51:18,medium,"A New Beginning, Part 22 W E E K 22: Already?!...","[Web Development, Codingbootcamp, Code Review,...",https://medium.com/@kalen7/a-surprising-career...,kalen hammann
1,Mike Brave,I make things with design and code. Here's to ...,Why I’m Taking a Chance on 42 – Mike Brave – M...,2018-08-06 07:04:49,medium,Why I’m Taking a Chance on 42 Edit: This is pa...,"[Life Lessons, 42, Codingbootcamp, Ecole 42]",https://medium.com/@themichaelbrave/why-im-tak...,mike brave
2,Irsan Sebastian,,Let’s start a journey ! – Irsan Sebastian – Me...,2016-11-05 14:19:30,medium,"Let’s start a journey ! Nama gue Irsan, gue se...","[Story, My Life, Codingbootcamp, Coding, Startup]",https://medium.com/@irsansebastian/lets-start-...,irsan sebastian
3,KeepCoding,We create the best learning experience for Ful...,KeepCoding Bootcamp KickOff Event – KeepCoding...,2017-05-16 12:38:42,medium,Keep Coding Bootcamp Kick Off Event2017 Mobile...,"[Programming, Codingbootcamp, Web Development,...",https://medium.com/@KeepCoding_/keepcoding-boo...,keepcoding
4,Sabio Coding Bootcamp,Lead by the most senior coding bootcamp staff ...,Coding Bootcamp Student Tells All – Sabio Codi...,2018-09-14 07:33:30,wesabio,Coding Bootcamp Student Tells All Sabio coding...,"[Veterans In Tech, Codingbootcamp]",https://blog.sabio.la/coding-bootcamp-student-...,sabio coding bootcamp


See the total number of babies with each name broken down by sex.

In [47]:
sex_counts = pd.pivot_table(babynames, index='LName', columns='Sex', values='Count',
                            aggfunc='sum', fill_value=0., margins=True)
sex_counts.head()

Sex,F,M,All
LName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
aaban,0.0,107.0,107.0
aabha,35.0,0.0,35.0
aabid,0.0,10.0,10.0
aabir,0.0,5.0,5.0
aabriella,32.0,0.0,32.0


Calculate the probability a name is female:

In [48]:
prob_female = sex_counts['F'] / sex_counts['All'] 
prob_female.head(10)

LName
aaban        0.000000
aabha        1.000000
aabid        0.000000
aabir        0.000000
aabriella    1.000000
aada         1.000000
aadam        0.000000
aadan        0.000000
aadarsh      0.000000
aaden        0.001073
dtype: float64

In [50]:
prob_female["shalini"]

1.0

Similarly, we can calculate the probability a name is male:

In [51]:
prob_male = sex_counts['M'] / sex_counts['All'] 
prob_male.head(10)

LName
aaban        1.000000
aabha        0.000000
aabid        1.000000
aabir        1.000000
aabriella    0.000000
aada         0.000000
aadam        1.000000
aadan        1.000000
aadarsh      1.000000
aaden        0.998927
dtype: float64

I will define a function to return the most likely `Sex` for a name. If there is an exact tie, the function returns Male. If the name does not appear in the social security dataset, return Unknown.

In [98]:
def sex_from_name(name):
    lower_name = name.lower()
    if lower_name in prob_female.index:
        return 'F' if prob_female[lower_name] > 0.5 else 'M'
    else:
        return "Unknown"

def prob_sex_from_name(name, sex):
#     lower_name = name.lower()
    if sex == "F":
        return prob_female[lower_name]
    elif sex == 'M':
        return prob_male[lower_name]
    else:
        return "Unknown"

In [89]:
shalini_sex = sex_from_name('shalini')
shalini_sex

'F'

In [85]:
sex_from_name('aaden')

'M'

In [90]:
prob_sex_from_name('shalini', shalini_sex)

1.0

Now let me find a way to apply this function to our own dataframe in order to predict the gender of the different authors.
Next steps:
* Need to get the first name of the author
* Filter out any companies/orgs that may be in the author column
* Apply sex_from_name function to names in the bootcamp dataframe
* Have a confidence interval (if possible) or different measure of how sure we are about the given gender
* Create different dataframes with org data and personal data (as publisher/author)

Let's go back to our dataframe and filter out anything that may be companies or organizations

In [91]:
Codingbootcamp_info.head()

Unnamed: 0,author,author_bio,title,date,publisher,text,tags,url,lowercase author
0,Kalen Hammann,,"A New Beginning, Part 22 – Kalen Hammann – Medium",2017-08-07 12:51:18,medium,"A New Beginning, Part 22 W E E K 22: Already?!...","[Web Development, Codingbootcamp, Code Review,...",https://medium.com/@kalen7/a-surprising-career...,kalen hammann
1,Mike Brave,I make things with design and code. Here's to ...,Why I’m Taking a Chance on 42 – Mike Brave – M...,2018-08-06 07:04:49,medium,Why I’m Taking a Chance on 42 Edit: This is pa...,"[Life Lessons, 42, Codingbootcamp, Ecole 42]",https://medium.com/@themichaelbrave/why-im-tak...,mike brave
2,Irsan Sebastian,,Let’s start a journey ! – Irsan Sebastian – Me...,2016-11-05 14:19:30,medium,"Let’s start a journey ! Nama gue Irsan, gue se...","[Story, My Life, Codingbootcamp, Coding, Startup]",https://medium.com/@irsansebastian/lets-start-...,irsan sebastian
3,KeepCoding,We create the best learning experience for Ful...,KeepCoding Bootcamp KickOff Event – KeepCoding...,2017-05-16 12:38:42,medium,Keep Coding Bootcamp Kick Off Event2017 Mobile...,"[Programming, Codingbootcamp, Web Development,...",https://medium.com/@KeepCoding_/keepcoding-boo...,keepcoding
4,Sabio Coding Bootcamp,Lead by the most senior coding bootcamp staff ...,Coding Bootcamp Student Tells All – Sabio Codi...,2018-09-14 07:33:30,wesabio,Coding Bootcamp Student Tells All Sabio coding...,"[Veterans In Tech, Codingbootcamp]",https://blog.sabio.la/coding-bootcamp-student-...,sabio coding bootcamp


A list of the things to filter out: when publisher is wesabio, code.likeagirl.io, makersacademy, hackernoon, fuerzamuktek, codeburst, ubiqum, fundapps, itnext.io, propulsioncodingacademy, PrototyprIO.

In [92]:
orgs = ["wesabio", "code.likeagirl.io", "makersacademy", "hackernoon", "fuerzamuktek", "codeburst", "ubiqum", "fundapps", "itnext.io", "propulsioncodingacademy", "PrototyprIO"]
new_codingdf = Codingbootcamp_info[~Codingbootcamp_info["publisher"].isin(orgs)]
orgs_df = Codingbootcamp_info[Codingbootcamp_info["publisher"].isin(orgs)]
new_codingdf

Unnamed: 0,author,author_bio,title,date,publisher,text,tags,url,lowercase author
0,Kalen Hammann,,"A New Beginning, Part 22 – Kalen Hammann – Medium",2017-08-07 12:51:18,medium,"A New Beginning, Part 22 W E E K 22: Already?!...","[Web Development, Codingbootcamp, Code Review,...",https://medium.com/@kalen7/a-surprising-career...,kalen hammann
1,Mike Brave,I make things with design and code. Here's to ...,Why I’m Taking a Chance on 42 – Mike Brave – M...,2018-08-06 07:04:49,medium,Why I’m Taking a Chance on 42 Edit: This is pa...,"[Life Lessons, 42, Codingbootcamp, Ecole 42]",https://medium.com/@themichaelbrave/why-im-tak...,mike brave
2,Irsan Sebastian,,Let’s start a journey ! – Irsan Sebastian – Me...,2016-11-05 14:19:30,medium,"Let’s start a journey ! Nama gue Irsan, gue se...","[Story, My Life, Codingbootcamp, Coding, Startup]",https://medium.com/@irsansebastian/lets-start-...,irsan sebastian
3,KeepCoding,We create the best learning experience for Ful...,KeepCoding Bootcamp KickOff Event – KeepCoding...,2017-05-16 12:38:42,medium,Keep Coding Bootcamp Kick Off Event2017 Mobile...,"[Programming, Codingbootcamp, Web Development,...",https://medium.com/@KeepCoding_/keepcoding-boo...,keepcoding
5,Mike Brave,I make things with design and code. Here's to ...,42 Piscine Day 11 — (08) – Mike Brave – Medium,2018-08-31 04:33:26,medium,42 Piscine Day 11 — (08) Edit: This is part of...,"[Life Lessons, Ecole 42, Education, Codingboot...",https://medium.com/@themichaelbrave/42-piscine...,mike brave
6,Better Developer,Sharing my experience in what it takes to beco...,Insight to Help With the Role You Want to Have...,2017-09-04 20:47:04,medium,Insight to Help With the Role You Want to Have...,"[Software Development, Learning To Code, Learn...",https://medium.com/@better_developer/insight-t...,better developer
7,adam tropp,,CS 100.3: Linked Lists – adam tropp – Medium,2018-05-16 21:39:10,medium,"C S 100. 3: Linked Lists Without further ado, ...","[Programming, Codingbootcamp, Computer Science]",https://medium.com/@adt6261/cs-100-3-linked-li...,adam tropp
8,Verity Honebon,,Good Code Karma – Verity Honebon – Medium,2017-07-17 23:32:17,medium,Good Code Karma When I was a kid I believed in...,"[Programming, Coding, Codingbootcamp, Collabor...",https://medium.com/@verityhonebon/good-code-ka...,verity honebon
9,Manchester Codes,https://mcr.codes,An evening at Manchester Codes — What to expec...,2018-03-19 18:36:41,medium,An evening at Manchester Codes — What to expec...,"[Programming, Manchester Codes, Manchester, Co...",https://medium.com/@MCRcodes/an-evening-at-man...,manchester codes
10,Code Collective,Exposing the world to code. Questions/Feedback...,Courses: – Code Collective – Medium,2018-09-01 18:52:20,medium,Courses: Stanford C S193 X Web Dev Course Syll...,"[Programming, Coding, Codingbootcamp, Code, Ed...",https://medium.com/@TexasCode/courses-76337619...,code collective


From a manual look at the filtered dataframe, I am also going to filter out rows that have author KeepCoding, Better Developer, Manchester Codes, Code Collective, Accelerate Tech, Rithm School.

In [93]:
more_orgs = ["KeepCoding", "Better Developer", "Manchester Codes", "Code Collective", "Accelerate Tech", "Rithm School"]
new_codingdf = new_codingdf[~new_codingdf["author"].isin(more_orgs)]
new_codingdf

Unnamed: 0,author,author_bio,title,date,publisher,text,tags,url,lowercase author
0,Kalen Hammann,,"A New Beginning, Part 22 – Kalen Hammann – Medium",2017-08-07 12:51:18,medium,"A New Beginning, Part 22 W E E K 22: Already?!...","[Web Development, Codingbootcamp, Code Review,...",https://medium.com/@kalen7/a-surprising-career...,kalen hammann
1,Mike Brave,I make things with design and code. Here's to ...,Why I’m Taking a Chance on 42 – Mike Brave – M...,2018-08-06 07:04:49,medium,Why I’m Taking a Chance on 42 Edit: This is pa...,"[Life Lessons, 42, Codingbootcamp, Ecole 42]",https://medium.com/@themichaelbrave/why-im-tak...,mike brave
2,Irsan Sebastian,,Let’s start a journey ! – Irsan Sebastian – Me...,2016-11-05 14:19:30,medium,"Let’s start a journey ! Nama gue Irsan, gue se...","[Story, My Life, Codingbootcamp, Coding, Startup]",https://medium.com/@irsansebastian/lets-start-...,irsan sebastian
5,Mike Brave,I make things with design and code. Here's to ...,42 Piscine Day 11 — (08) – Mike Brave – Medium,2018-08-31 04:33:26,medium,42 Piscine Day 11 — (08) Edit: This is part of...,"[Life Lessons, Ecole 42, Education, Codingboot...",https://medium.com/@themichaelbrave/42-piscine...,mike brave
7,adam tropp,,CS 100.3: Linked Lists – adam tropp – Medium,2018-05-16 21:39:10,medium,"C S 100. 3: Linked Lists Without further ado, ...","[Programming, Codingbootcamp, Computer Science]",https://medium.com/@adt6261/cs-100-3-linked-li...,adam tropp
8,Verity Honebon,,Good Code Karma – Verity Honebon – Medium,2017-07-17 23:32:17,medium,Good Code Karma When I was a kid I believed in...,"[Programming, Coding, Codingbootcamp, Collabor...",https://medium.com/@verityhonebon/good-code-ka...,verity honebon
12,Ari Kramer,,Taking the leap into coding was the best decis...,2018-05-06 14:38:55,medium,Taking the leap into coding was the best decis...,"[Coding, Software Development, Codingbootcamp]",https://medium.com/@arikramer24/taking-the-lea...,ari kramer
13,Erin Levine,Pretending to be an adult. Discussin' the ebb ...,Week 7: Express and Handlebars – Erin Levine –...,2017-07-02 14:26:06,medium,Week 7: Express and Handlebars You ever sit do...,"[JavaScript, Web Development, Coding, Codingbo...",https://medium.com/@erinlevine_48138/week-7-ex...,erin levine
14,conshus,the Black MacGyver | @OURshow (Sat 5-7pm WPRK ...,Day 38 — Rewind – conshus – Medium,2017-05-03 10:45:00,medium,Day 38 — Rewindliner notes: Yesterday turned o...,"[Reactjs, Firebase, Coding, Codingbootcamp, Hi...",https://medium.com/@conshus/day-38-rewind-7511...,conshus
15,Mike Brave,I make things with design and code. Here's to ...,42 Piscine Day 23 – Mike Brave – Medium,2018-09-12 02:20:49,medium,42 Piscine Day 23 Edit: This is part of a seri...,"[Art, Ecole 42, Code, Education, Codingbootcamp]",https://medium.com/@themichaelbrave/42-piscine...,mike brave


Now I need to get the first name of the author in "lowercase author" in order to put it through the gender function.

In [94]:
new_codingdf["first name"] = new_codingdf["lowercase author"].str.split(' ').str.get(0)
new_codingdf.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,author,author_bio,title,date,publisher,text,tags,url,lowercase author,first name
0,Kalen Hammann,,"A New Beginning, Part 22 – Kalen Hammann – Medium",2017-08-07 12:51:18,medium,"A New Beginning, Part 22 W E E K 22: Already?!...","[Web Development, Codingbootcamp, Code Review,...",https://medium.com/@kalen7/a-surprising-career...,kalen hammann,kalen
1,Mike Brave,I make things with design and code. Here's to ...,Why I’m Taking a Chance on 42 – Mike Brave – M...,2018-08-06 07:04:49,medium,Why I’m Taking a Chance on 42 Edit: This is pa...,"[Life Lessons, 42, Codingbootcamp, Ecole 42]",https://medium.com/@themichaelbrave/why-im-tak...,mike brave,mike
2,Irsan Sebastian,,Let’s start a journey ! – Irsan Sebastian – Me...,2016-11-05 14:19:30,medium,"Let’s start a journey ! Nama gue Irsan, gue se...","[Story, My Life, Codingbootcamp, Coding, Startup]",https://medium.com/@irsansebastian/lets-start-...,irsan sebastian,irsan
5,Mike Brave,I make things with design and code. Here's to ...,42 Piscine Day 11 — (08) – Mike Brave – Medium,2018-08-31 04:33:26,medium,42 Piscine Day 11 — (08) Edit: This is part of...,"[Life Lessons, Ecole 42, Education, Codingboot...",https://medium.com/@themichaelbrave/42-piscine...,mike brave,mike
7,adam tropp,,CS 100.3: Linked Lists – adam tropp – Medium,2018-05-16 21:39:10,medium,"C S 100. 3: Linked Lists Without further ado, ...","[Programming, Codingbootcamp, Computer Science]",https://medium.com/@adt6261/cs-100-3-linked-li...,adam tropp,adam


Now put it through the gender function

In [95]:
new_codingdf["gender"] = new_codingdf["first name"].apply(sex_from_name)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [96]:
new_codingdf.head()

Unnamed: 0,author,author_bio,title,date,publisher,text,tags,url,lowercase author,first name,gender
0,Kalen Hammann,,"A New Beginning, Part 22 – Kalen Hammann – Medium",2017-08-07 12:51:18,medium,"A New Beginning, Part 22 W E E K 22: Already?!...","[Web Development, Codingbootcamp, Code Review,...",https://medium.com/@kalen7/a-surprising-career...,kalen hammann,kalen,M
1,Mike Brave,I make things with design and code. Here's to ...,Why I’m Taking a Chance on 42 – Mike Brave – M...,2018-08-06 07:04:49,medium,Why I’m Taking a Chance on 42 Edit: This is pa...,"[Life Lessons, 42, Codingbootcamp, Ecole 42]",https://medium.com/@themichaelbrave/why-im-tak...,mike brave,mike,M
2,Irsan Sebastian,,Let’s start a journey ! – Irsan Sebastian – Me...,2016-11-05 14:19:30,medium,"Let’s start a journey ! Nama gue Irsan, gue se...","[Story, My Life, Codingbootcamp, Coding, Startup]",https://medium.com/@irsansebastian/lets-start-...,irsan sebastian,irsan,Unknown
5,Mike Brave,I make things with design and code. Here's to ...,42 Piscine Day 11 — (08) – Mike Brave – Medium,2018-08-31 04:33:26,medium,42 Piscine Day 11 — (08) Edit: This is part of...,"[Life Lessons, Ecole 42, Education, Codingboot...",https://medium.com/@themichaelbrave/42-piscine...,mike brave,mike,M
7,adam tropp,,CS 100.3: Linked Lists – adam tropp – Medium,2018-05-16 21:39:10,medium,"C S 100. 3: Linked Lists Without further ado, ...","[Programming, Codingbootcamp, Computer Science]",https://medium.com/@adt6261/cs-100-3-linked-li...,adam tropp,adam,M


In [77]:
new_codingdf["gender"].value_counts()

M          499
F          232
Unknown    226
Name: gender, dtype: int64

We have quite a few unknowns in our dataset (almost one fourth, so let's just take a look at these values)

In [78]:
new_codingdf[new_codingdf["gender"] == "Unknown"]

Unnamed: 0,author,author_bio,title,date,publisher,text,tags,url,lowercase author,first name,gender
2,Irsan Sebastian,,Let’s start a journey ! – Irsan Sebastian – Me...,2016-11-05 14:19:30,medium,"Let’s start a journey ! Nama gue Irsan, gue se...","[Story, My Life, Codingbootcamp, Coding, Startup]",https://medium.com/@irsansebastian/lets-start-...,irsan sebastian,irsan,Unknown
14,conshus,the Black MacGyver | @OURshow (Sat 5-7pm WPRK ...,Day 38 — Rewind – conshus – Medium,2017-05-03 10:45:00,medium,Day 38 — Rewindliner notes: Yesterday turned o...,"[Reactjs, Firebase, Coding, Codingbootcamp, Hi...",https://medium.com/@conshus/day-38-rewind-7511...,conshus,conshus,Unknown
23,이상훈,,"CodeStates_immersive 3기 2–2 : call, apply, bin...",2017-01-03 15:47:34,medium,"Code States_immersive 3기 2–2 : call, apply, bi...","[Codingbootcamp, JavaScript, Function, Front E...",https://medium.com/@fkdndpf1/codestates-immers...,이상훈,이상훈,Unknown
28,Leibel Hecht,I’m Leibel. I’m transitioning from Talmud to T...,Weather Mood Booster App. Yay! I made an app 🤓...,2017-06-04 22:12:18,medium,Weather Mood Booster App. Yay! I made an app 🤓...,"[Data Science, Data, Codingbootcamp, Coding, W...",https://medium.com/my-coding-dojo-experience/w...,leibel hecht,leibel,Unknown
34,D Saunders,Excited about Big data + education + economy,Galvanize Weeks 3 & 4 of the Data Science Imme...,2018-10-04 03:00:33,medium,"It’s 6:15pm, we’ve been here since 8:30, and t...","[Machine Learning, Codingbootcamp, Education, ...",https://medium.com/@dmariesaunders/galvanize-w...,d saunders,d,Unknown
43,Develop Me,We are a talent and digital skills accelerator...,Meet the Instructor: My Life in Code – Develop...,2018-05-09 09:26:02,medium,Meet the Instructor: My Life in Code Mark Wale...,"[Programming, Meet The Instructor, Coding, Tec...",https://medium.com/@develop_me_uk/meet-the-ins...,develop me,develop,Unknown
46,Altcademy Team,The team behind Altcademy.com,Student Stories: Christl Tiu – Altcademy – Medium,2018-08-20 03:00:30,medium,Student Stories: Christl Tiu Christl is curren...,"[Programming, Student Stories, Learning To Cod...",https://medium.com/altcademy/student-stories-c...,altcademy team,altcademy,Unknown
47,conshus,the Black MacGyver | @OURshow (Sat 5-7pm WPRK ...,Day 47 — First Things First – conshus – Medium,2017-05-16 11:14:48,medium,Day 47 — First Things Firstliner notes: I figu...,"[JavaScript, Materialize, React, Hip Hop, Codi...",https://medium.com/@conshus/day-47-first-thing...,conshus,conshus,Unknown
49,Ilsmarie Presilia,25-year-old autodidact that likes to ponder. D...,Tips on how to survive bootcamp induced burnou...,2017-08-04 17:13:17,medium,Tips on how to survive bootcamp induced burnou...,"[Life Lessons, Failure, Codingbootcamp, Depres...",https://medium.com/@ipresilia/tips-on-how-to-s...,ilsmarie presilia,ilsmarie,Unknown
51,Horizons,,"Horizons, Chapter 7 – Horizons School of Techn...",2016-08-04 20:19:28,medium,"Horizons, Chapter 7 At the beginning of Week 7...","[Entrepreneurship, Startup, Technology, Coding...",https://medium.com/horizons-education/horizons...,horizons,horizons,Unknown


We can see that many of the authors in this part are different schools/workshops/organizations that we missed out on earlier, or they are more international type of names, or they are names with a letter for the first name followed by the full last name (these we may have to go through by hand in order to determine gender). Many are also from an author called "conshus"