# Core API Quickstart
By [Leon Yin](leonyin.org) for [SMaPP NYU](https://wp.nyu.edu/smapp/)

[urlExpander](https://github.com/SMAPPNYU/urlExpander) is a Python package for quickly and thoroughly expanding URLs.

You can download the software using pip:

In [1]:
import os
import urlexpander
from runtimestamp.runtimestamp import runtimestamp
runtimestamp('QuickStart User')
print(f"This notebook is using urlExpander v{urlexpander.__version__}")

Updated 2021-11-05 16:35:16.957041
By QuickStart User
Using Python 3.8.10
On Linux-5.4.0-89-generic-x86_64-with-glibc2.29
This notebook is using urlExpander v0.0.38


Here is a toy example of some URLs taken from Congressional Twitter accounts:

In [2]:
urls = [
    'https://trib.al/xXI5ruM',
    'http://bit.ly/1Sv81cj',
    'https://www.youtube.com/watch?v=8NwKcfXvGl4',
    'https://t.co/zNU1eHhQRn',
]

We can use the `expand` function (see the code) to unshorten any link:

In [3]:
urlexpander.expand(urls[0])

'https://www.breitbart.com/clips/2017/12/31/lindsey-graham-trump-just-cant-tweet-iran/'

It also works on any list of URLs.

In [4]:
urlexpander.expand(urls)

['https://www.breitbart.com/clips/2017/12/31/lindsey-graham-trump-just-cant-tweet-iran/',
 'https://www.hugedomains.com/domain_profile.cfm?d=billshusterforcongress&e=com',
 'https://www.youtube.com/watch?v=8NwKcfXvGl4',
 'https://www.nfib.com/content/press-release/elections/small-business-endorses-shuster-for-reelection-73730/?utm_campaign=Advocacy&utm_source=Twitter&utm_medium=Social']

To save compute time, we can skip links that don't need to be expanded.<br>
The `is_short` function takes any url and checks if the domain is from a known list of link shorteners

In [5]:
print(f"{urls[1]} returns:")
urlexpander.url_utils.is_short(urls[1])

http://bit.ly/1Sv81cj returns:


True

bit.ly is probably the best known link shortener, Youtube.com however is not a link shortener!

In [6]:
print(f"{urls[2]} returns:")
urlexpander.url_utils.is_short(urls[2])

https://www.youtube.com/watch?v=8NwKcfXvGl4 returns:


False

urlExpander takes advantage of a list of known domains that offer link shortening services.

In [7]:
known_shorteners = urlexpander.constants.all_short_domains.copy()
print(len(known_shorteners))

86


You can make modifications or use your own `list_of_domains` as an argument for the `is_short` function.

In [8]:
known_shorteners += ['youtube.com']

In [9]:
print(f"Now {urls[2]} returns:")
urlexpander.url_utils.is_short(urls[2], list_of_domains=known_shorteners) # this is the default

Now https://www.youtube.com/watch?v=8NwKcfXvGl4 returns:


True

Now we can shorten our workload:

In [10]:
# filter only domains that need to be shortenened
urls_to_shorten = [link for link in urls if urlexpander.url_utils.is_short(link)]
urls_to_shorten

['https://trib.al/xXI5ruM', 'http://bit.ly/1Sv81cj', 'https://t.co/zNU1eHhQRn']

urlExpander's `multithread_expand()` does heavy lifting to quickly and thoroughly expand a list of links:

In [11]:
expanded_urls = urlexpander.expand(urls_to_shorten)
expanded_urls

['https://www.breitbart.com/clips/2017/12/31/lindsey-graham-trump-just-cant-tweet-iran/',
 'https://www.hugedomains.com/domain_profile.cfm?d=billshusterforcongress&e=com',
 'https://www.nfib.com/content/press-release/elections/small-business-endorses-shuster-for-reelection-73730/?utm_campaign=Advocacy&utm_source=Twitter&utm_medium=Social']

<i>Note that URLs that resolve to defunct pages, still return the domain name -- followed by the type of error surrounded by two underscores IE `http://www.billshusterforcongress.com/__CONNECTIONPOOL_ERROR__`.</i>

Instead of filtering the inputs before running the `expand` function, you can assign a filter using the `filter_function` argument.<br>
Filter functions can be any boolean function that operates on a string. Below is an example function that filters for t.co links:

In [12]:
def custom_filter(url):
    '''This function returns True if the url is a shortened Twitter URL'''
    if urlexpander.url_utils.get_domain(url) == 't.co':
        return True
    else:
        return False

In [13]:
resolved_links = urlexpander.expand(urls, 
                                    filter_function=custom_filter, 
                                    verbose=1)
resolved_links

There are 1 URLs to expand


100%|██████████| 1/1 [00:12<00:00, 12.97s/it]


['https://trib.al/xXI5ruM',
 'http://bit.ly/1Sv81cj',
 'https://www.youtube.com/watch?v=8NwKcfXvGl4',
 'https://www.nfib.com/content/press-release/elections/small-business-endorses-shuster-for-reelection-73730/?utm_campaign=Advocacy&utm_source=Twitter&utm_medium=Social']

Although filtering within the `expand` function is convenient, you will see changes in performance time.

In [14]:
resolved_links = urlexpander.expand(urls,  
                                    filter_function=urlexpander.url_utils.is_short,
                                    verbose=1)
resolved_links

There are 3 URLs to expand


100%|██████████| 1/1 [00:29<00:00, 29.53s/it]


['https://www.breitbart.com/clips/2017/12/31/lindsey-graham-trump-just-cant-tweet-iran/',
 'https://www.hugedomains.com/domain_profile.cfm?d=billshusterforcongress&e=com',
 'https://www.youtube.com/watch?v=8NwKcfXvGl4',
 'https://www.nfib.com/content/press-release/elections/small-business-endorses-shuster-for-reelection-73730/?utm_campaign=Advocacy&utm_source=Twitter&utm_medium=Social']

<hr>

But that is a toy example, let's see how this fairs with a larger dataset.<br>
This package comes with a [sampled dataset](https://github.com/SMAPPNYU/urlExpander/blob/master/urlexpander/core/datasets.py#L8-L29) of links extracted from Twitter accounts from the 115th Congress. <br>
If you work with Twitter data you'll be glad to know there is a function `urlexpander.tweet_utils.get_link` for creating a similar dataset from Tweets.

In [15]:
df_congress = urlexpander.datasets.load_congress_twitter_links(nrows=100)

print(f'The dataset has {len(df_congress)} rows')
df_congress.tail(2)

The dataset has 100 rows


Unnamed: 0,link_domain,link_url_long,link_url_short,tweet_created_at,tweet_id,tweet_text,user_id
98,on.fb.me,http://on.fb.me/pt4GTD,http://t.co/6gG88Kb,Tue Aug 02 23:01:13 +0000 2011,98528808268857344,Hope you can make it to my town hall in #Modes...,248699486
99,facebook.com,https://www.facebook.com/HouseSmallBizDemocrats/,https://t.co/FZgTIq9l2N,Thu Jul 06 14:00:03 +0000 2017,882962350256513025,Follow Small Business Dems on Facebook to lear...,164369297


In [16]:
shortened_urls = df_congress[df_congress.link_domain.apply(urlexpander.url_utils.is_short)].tweet_id.nunique()
all_urls = df_congress.tweet_id.nunique()
print(f"About {(shortened_urls / all_urls)*100}% of the links are short!")

About 38.0% of the links are short!


The performance of the next script is dependent on your internet connection:

In [17]:
# !curl -s https://raw.githubusercontent.com/sivel/speedtest-cli/master/speedtest.py | python -

Let's see how long it takes to expand these links.<br>

This is where the optional parameters for `expand` shine.
We can created multiple threads for requests (using `n_workers`), cache results into a json file (`cache_file`), and chunk the input into smaller pieces (using `chunksize`). Why does this last part matter? Something I noticed when expanding links in mass is that performance degrades over time. Chunking the input prevents this from happening (not sure why though)!

In [18]:
resolved_links = urlexpander.expand(df_congress['link_url_long'], 
                                    chunksize=1280,
                                    n_workers=8, 
                                    cache_file=os.path.join('output', 'temp_cache.json'), 
                                    verbose=1,
                                    filter_function=urlexpander.url_utils.is_short)

There are 38 URLs to expand


100%|██████████| 1/1 [00:59<00:00, 59.38s/it]


At SMaPP, the process of link expansion has been a burden on our research.<br>
We hope that this software helps you overcome similar obstacles!

In [19]:
df_congress['expanded_url'] = resolved_links
df_congress['resolved_domain'] = df_congress['expanded_url'].apply(urlexpander.url_utils.get_domain)
df_congress.tail(2)

Unnamed: 0,link_domain,link_url_long,link_url_short,tweet_created_at,tweet_id,tweet_text,user_id,expanded_url,resolved_domain
98,on.fb.me,http://on.fb.me/pt4GTD,http://t.co/6gG88Kb,Tue Aug 02 23:01:13 +0000 2011,98528808268857344,Hope you can make it to my town hall in #Modes...,248699486,https://www.facebook.com/RepJeffDenham/posts/1...,facebook.com
99,facebook.com,https://www.facebook.com/HouseSmallBizDemocrats/,https://t.co/FZgTIq9l2N,Thu Jul 06 14:00:03 +0000 2017,882962350256513025,Follow Small Business Dems on Facebook to lear...,164369297,https://www.facebook.com/HouseSmallBizDemocrats/,facebook.com


Here are the top 25 shared domains from this sampled Congress dataset:

In [20]:
df_congress.resolved_domain.value_counts().head(25)

twitter.com            15
facebook.com           11
house.gov               4
senate.gov              3
politico.com            3
youtube.com             2
wapa.tv                 2
google.com              2
healthcare.gov          2
usa.gov                 2
nytimes.com             2
instagram.com           2
cityofhesperia.us       1
khanacademy.org         1
wh.gov                  1
billkeating.org         1
huffpost.com            1
mass.gov                1
foxnews.com             1
sarahpalin-blog.com     1
vaildaily.com           1
newyorker.com           1
kokefm.com              1
thinkprogress.org       1
thehill.com             1
Name: resolved_domain, dtype: int64

<hr>

# Bonus Round!
You can count number of `resolved_domain`s for each `user_id ` using `count_matrix()`.<br>
You can even choose which domains are counted by modifying the `domain_list` arg:

In [21]:
count_matrix = urlexpander.tweet_utils.count_matrix(df_congress,
                                                    user_col='user_id', 
                                                    domain_col='resolved_domain', 
                                                    unique_count_col='tweet_id',
                                                    domain_list=['youtube.com','facebook.com', 'google.com', 'twitter.com'])

count_matrix.tail(3)

Unnamed: 0_level_0,facebook.com,youtube.com,twitter.com,google.com
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3026622545,1,0,0,0
817050219007328258,0,0,1,0
828977216595849216,0,0,1,0


One of the domain lists you might be interested in are US national media outlets -
`datasets.load_us_national_media_outlets()` compiled by Gregory Eady (Forthcoming).

In [22]:
urlexpander.datasets.load_us_national_media_outlets()[:5]

array(['abcnews.go.com', 'aim.org', 'alternet.org',
       'theamericanconservative.com', 'prospect.org'], dtype=object)

<hr>
We also built a one-size-fits-all scraper that returns the title, description, and/or paragraphs from any given URL.

In [23]:
url = urls[0]
example = urlexpander.expand_with_content(url)
html = example['resolved_text']

In [24]:
urlexpander.html_utils.search_webpage_title(html)

"Lindsey Graham to Trump: 'You Just Can't Tweet' About Iran"

In [25]:
urlexpander.html_utils.search_webpage_description(html)

'Sunday CBS\'s "Face the Nation," while discussing the last several\xa0days of protests in Iran over\xa0government corruption, Sen. Lindsey Graham (R-SC) warned | Clips'

In [26]:
urlexpander.html_utils.search_webpage_meta(url, html)

{'url': 'https://trib.al/xXI5ruM',
 'title': "Lindsey Graham to Trump: 'You Just Can't Tweet' About Iran",
 'description': 'Sunday CBS\'s "Face the Nation," while discussing the last several\xa0days of protests in Iran over\xa0government corruption, Sen. Lindsey Graham (R-SC) warned | Clips',
 'image_url': 'https://media.breitbart.com/media/2017/12/Lindsey-Graham.jpg'}

## Conclusion
Thanks for stumbling upon this package, we hope that it will lead to more research around links.<br>
We're working on some projects in this vein and would love to know if you are too!

As an open source package, please feel to reach out about bugs, feature requests, or collaboration!