### GitHub README Extraction 

During this summer, the DSGP OSS 2021 team willbe classifying GitHub repositores into different software types. To do this, we will be extracting README files from all of the OSS repos (i.e. those with OSS licenses) and then developing NLP techniques to classify those repos. In this file, we document the extraction process for GitHub README files. 

First, we load our packages 

In [1]:
# load packages 
import os
import psycopg2 as pg
from sqlalchemy import create_engine
import pandas as pd
import requests as r
import string 
import json
import base64
import urllib.request
import itertools 
from bs4 import BeautifulSoup
from dotenv import load_dotenv
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

myPath = '/sfs/qumulo/qhome/kb7hp/git/oss-2020/src/09_repository-scraping/'; print(myPath)

/sfs/qumulo/qhome/kb7hp/git/oss-2020/src/09_repository-scraping/


Next, we will grab our data from the database. 

In [2]:
# connect to the database, download data 
connection = pg.connect(host = 'postgis1', database = 'sdad', 
                        user = os.environ.get('db_user'), 
                        password = os.environ.get('db_pwd'))

raw_slug_data = '''SELECT slug FROM gh.repos LIMIT 100'''

# convert to a dataframe, show how many missing we have (none)
raw_slug_data = pd.read_sql_query(raw_slug_data, con=connection)
raw_slug_data.head()

Unnamed: 0,slug
0,zz44-b/pkg
1,zz44-b/tools
2,tenkjm/hw-8
3,dasfoo/rover
4,dashacker/dnd5e


In [3]:
slugs = ["brandonleekramer/diversity", "uva-bi-sdad/oss-2020", "facebook/react"] #test data 
#slugs = raw_slug_data.slug.tolist()

for slug in slugs:
    url = f'https://github.com/{slug}/blob/master/README.m'
    split_slugs = slug.split("/")
    login = split_slugs[0]
    repo = split_slugs[1]
    fullfilename = os.path.join('/sfs/qumulo/qhome/kb7hp/git/oss-2020/src/09_repository-scraping/', f'readme_{login}_{repo}.txt')
    urllib.request.urlretrieve(url, fullfilename)
    print(f'Finished scraping: {login}/{repo}')

Finished scraping: brandonleekramer/diversity
Finished scraping: uva-bi-sdad/oss-2020
Finished scraping: facebook/react


Note to Crystal: 

The function above this note works for small-scale scraping but we need to add in the rate limit on API calls before we scale up. 
https://stackoverflow.com/questions/40748687/python-api-rate-limiting-how-to-limit-api-calls-globally

We could also try to add in multiprocessing to speed things up. I'm not sure this link is the right one, but we can chat more about that.
https://stackoverflow.com/questions/54858979/how-to-use-multiprocessing-with-requests-module

In [4]:
repo_name = []
readme_text = [] 
for filename in os.listdir(myPath):
    if filename.endswith('.txt'):
        with open(os.path.join(myPath, filename)) as f:
            content = f.read()
            soup = BeautifulSoup(content, 'html.parser')
            clean_html = ''.join(soup.article.findAll(text=True))
            repo_name.append(filename)
            readme_text.append(clean_html)
            df = pd.DataFrame({'slug': repo_name, 'readme_text': readme_text}, columns=["slug", "readme_text"])
            df['slug'] = df['slug'].str.replace('readme_','')
            df['slug'] = df['slug'].str.replace('.txt','')
            # this works because slugs can't have underscores
            df['slug'] = df['slug'].str.replace('_','/') 
df 

Unnamed: 0,slug,readme_text
0,brandonleekramer/diversity,The Rise of Diversity and Population Terminolo...
1,facebook/react,React · \nReact is a JavaScript library for...
2,uva-bi-sdad/oss-2020,UVA-BII Open Source Software 2020-21\nAs of: 0...


We need to write this to the database now... 

Try this: https://medium.com/analytics-vidhya/part-3-5-pandas-dataframe-to-postgresql-using-python-d3bc41fcf39 
