# DSC 80: Homework 07

### Due Date: Monday, Feb 25 12:00PM

## Instructions
Much like in DSC 10, this Jupyter Notebook contains the statements of the homework problems and provides code and markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding work will be developed in an accompanying `hw0X.py` file, that will be imported into the current notebook. (`X` is a homework number)

Homeworks and programming assignments will be graded in (at most) two ways:
1. The functions and classes in the accompanying python file will be tested (a la DSC 20),
2. The notebook will be graded (for graphs and free response questions).


**Do not change the function names in the `*.py` file**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name. The dictionary at the end of the file (`GRADED FUNCTIONS`) contains the "grading list". The final function in the file allows your doctests to check that all the necessary functions exist.
- If you changed something you weren't supposed to, just use git to revert!

**Tips for working in the Notebook**:
- The notebooks serve to present you the questions and give you a place to present your results for later review.
- The notebook on *HW assignments* are not graded (only the `.py` file).
- Notebooks for PAs will serve as a final report for the assignment, and contain conclusions and answers to open ended questions that are graded.
- The notebook serves as a nice environment for 'pre-development' and experimentation before designing your function in your `.py` file.

**Tips for developing in the .py file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional functions to solve the HW! 
    - Developing in python usually consists of larger files, with many short functions.
    - You may write your other functions in an additional `.py` file that you import in `hw0X.py` (much like we do in the notebook).
- Always document your code!

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import hw07 as hw

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob
import os
import time
import re

In [None]:
import requests
import json

# API requests

**Question 1**

You realized that you have a lot of free time and you decided to learn a new language. Since you are not sure which language to choose, you decided to check out the News API: `https://newsapi.org/`. The NewsAPI allows you to make requests to a URL and get a response based on the parameters and values you request. In order to send requests you must click on the `Get API Key` button and fill out the form to access a key. Save this key; you need it to access the API. You will have it saved in your account as well. 

I'd suggest you explore the site (especially `Get started page`). Once you have an idea what is required to send requests, you are ready to test things out. Since you need to decide what language to learn, you decided to compare the number of articles in different languages: 

1. Write a function `send_requests` that takes an `apiKey` as a parameter, and a variable number of arguments that represent languages for your requests. (use \*args to represent a variable number of arguments. <a href= ": https://pythontips.com/2013/08/04/args-and-kwargs-in-python-explained/">Link for help</a>). This function should return a list of dictionaries, where each key-value pair is a languages parameter and the corresponding Response object.  

For example, after calling `output = send_requests('625f44103f1c4fdca96844baefad03f5',"ru", "fr") ` the *output* is ```[{'ru': <Response [200]>}, {'fr': <Response [200]>}]```


2. Write a function `gather_info` that takes a list like `outputs` and returns a list with:
* language that has the most number of articles (based on "Total Results")
* Each article was taken from a certain source (url). You need to find the most popular base url (without http) for each language.

For languages `ru` and `fr` the output was `['fr', 'lenta.ru', 'www.lequipe.fr']`

**Requirements:**

1. You only can send requests in one function. One request per language. 
2. Do not compare your results with the given outputs. The news is updated often.
3. You must have at least two helper methods. 
4. Keep your apikey hidden. This means that you can not use it explicitly in the doctests.  You need to set an environment variable to the value of your unique apikey, which you will get from the newsapi site, and then use os.environ['name of your enviroment variable'] to obtain the value.

# The power of REGEX

**Question 2**

You start with some basic regular expression exercises to get some practice using them. You will find function stubs and related doctests in the starter code. 


Your pattern should catch (in a given text):

**Question 1:** *abc* pattern.

**Question 2:** three digits in a row.

**Question 3:** any string that has "." as the fourth character.

**Question 4:** any string that starts with *c*, *m*, or *f* and is followed by *an*.

**Question 5:** any string that contains *og* and does not start with a *b*.

**Question 6:** words that begin with a capital letter.

**Question 7:** any version of the word *wazup*, as long as more than 1 z is used.

**Question 8:** any string containing, in order, at least one *a*, and at least one *b* or *c*.

**Question 9:** any variation of the string *N file(s) found?*.

**Question 10:** any enumeration follow by a whitespace character followed by any text.

**Question 11:** Your pattern should match strings that begin with the word *Mission:* and end with the word *successful*. Words in the middle are allowed.


## Regex groups: extracting personal information from messy data

**Question 3**

The file in `data/messy.txt` contains personal information from a fictional website that a user scraped from webserver logs. Within this dataset, there are four fields that interest you:
1. Email Addresses (assume they are alphanumeric user-names and domain-names),
2. [Social Security Numbers](https://en.wikipedia.org/wiki/Social_Security_number#Structure)
3. Bitcoin Addresses (alpha-numeric strings of long length)
4. Street Addresses

Create a function `extract_personal` that takes in a string like `open('data/messy.txt').read()` and returns a tuple of four separate lists containing values of the 4 pieces of information listed above (in the order given). Do **not** keep empty values.

*Hint*: There are multiple "delimiters" in use in the file; there are few enough of them that you can safely determine what they are.

*Note:* Since this data is messy/corrupted, your function will be allowed to miss ~5% of the records in each list. Good spot checking using certain useful substrings (e.g. `@` for emails) should help assure correctness! Your function will be tested on a sample of the file `messy.txt`.

In [None]:
s = open('data/messy.txt').read()

In [None]:
s[:1000]

## Content in Amazon review data

**Question 4**

The dataset `reviews.txt` contains [Amazon reviews](http://jmcauley.ucsd.edu/data/amazon/) for ~200k phones and phone accessories. This dataset has been "cleaned" for you. The goal of this section is to create a function that takes in the review dataset and a review and returns the word that "best summarizes the review" using TF-IDF.'

1. Create a function `tfidf_data(review, reviews)` that takes a review as well as the review data and returns a dataframe:
    - indexed by the words in `review`,
    - with columns given by (a) the number of times each word is found in the review (`cnt`), (b) the term frequency for each word (`tf`), (c) the inverse document frequency for each word (`idf`), and (d) the TF-IDF for each word (`tfidf`).
    
2. Create a function `relevant_word(tfidf_data)` which takes in a dataframe as above and returns the word that "best summarizes the review" described by `tfidf_data`.


*Note:* Use this function to "cluster" review types -- run it on a sample of reviews and see which words come up most. Unfortunately, you will likely have to change your code from your answer above to run it on the entire dataset (to do this, you should compute as many of the frequencies "ahead of time" and look them up when needed; you should also likely filter out words that occur "rarely")

In [None]:
reviews = pd.read_csv('data/reviews.txt', header=None, squeeze=True)
review = open('data/review.txt').read().strip()

### Tweet Analysis: Internet Research Agency

The dataset `data/ira.csv` contains tweets tagged by Twitter as likely being posted by the *Internet Research Angency* (the tweet factory facing allegations for attempting to influence US political elections).

The questions in this section will focus on the following:
1. We will look at the hashtags present in the text and trends in their makeup.
2. We will prepare this dataset for modeling by creating features out of the text fields.

**Question 5 (HashTags)**

* Create a function `hashtag_list` that takes in a column of tweet-text and returns a column containing the list of hashtags present in the tweet text.

* Create a function `most_common_hashtag` that takes in a column of hashtag-lists (the output above) and returns a column consisting a single hashtag from the tweet-text. 
    - If the text has no hashtags, the entry should be `NaN`,
    - If the text has one distinct hashtag, the entry should contain that hashtag,
    - If the text has more than one hashtag, the entry should be the most common hashtag (among all tweets). If there is a tie for most common, any of the most common can be returned.


In [2]:
ira = pd.read_csv('data/ira.csv', names=['id', 'name', 'date', 'text'])

**Question 6 (Features)**

Now create a dataframe of features from the `ira` data.  That is create a function `create_features` that takes in the `ira` data and returns a dataframe with the same index as `ira` (i.e. the rows correspond to the same tweets) and the following columns:
* `num_hashtags` gives the number of hashtags present in a tweet,
* `mc_hashtag` gives the most common hashtag associated to a tweet (as given by the problem above),
* `num_tags` gives the number of tags a given tweet has (look for the presence of `@`),
* `num_links` gives the number of hyper-links present in a given tweet,
* A boolean column `is_retweet` that describes if the given tweet is a retweet (i.e. `RT`),
* A 'clean' text field `text` that contains the tweet text with:
    - The non-alphanumeric characters removed (except spaces),
    - The characters all lowercase,
    - All the meta-information above (Retweet info, tags, hyperlinks, hashtags) removed.

*Note:* You should make a helper function for each column.

*Note:* This will take a while to run on the entire dataset -- test it on a small sample first!

***Congratulations, you're done with the homework***

Now, run your doctests and upload hw07.py to GradeScope.