# U.S. Census Bureau - Predictive Analysis


By Savahnna L. Cunningham

MSDA- Data Science Tools & Techniques

Date: September 15, 2018

### Introduction

Project will use two programming languages, Python and R, respectively to perform data scraping, wrangling data and a linear regession analysis using "U.S. Census Bureau" data to predict the size of the population of your state in 2020.

### Part 1: Python 

##### Web Scraping

Develop a web links scraper program in Python that extracts all of the unique web links that point out to other web pages from the HTML code of the “Current Estimates” web link, both from the “US Census Bureau” website and outside that domain, and that populates them in a comma-separated values (CSV) file as absolute uniform resource indicators (URIs).

In [1]:
#import libraries needed for data wrangling
import pandas as pd
import requests
from bs4 import BeautifulSoup, SoupStrainer
import lxml
import csv
import re

In [2]:
#"Current Estimates" web link
url = "https://www.census.gov/programs-surveys/popest.html"
r = requests.get(url)
raw_html = r.text

#Pass the webpage html into BS4
soup = BeautifulSoup(raw_html, 'lxml')

#Save html at time of query
with open('census_html.txt', 'w') as f:
   f.write(soup.prettify())

In [3]:
#Collect all web links that direct to an html page 
tags = soup.find_all(lambda tag: tag.name == 'a' and tag.get('href') and tag.text)

In [11]:
#find all web links & add to set to eliminate duplicate records
links= set()
for item in soup.find_all(lambda tag: tag.name == 'a'and tag.get('href') and tag.text):
    links.add(item.get('href'))  
links

{'#',
 '#NAV_1082186784_0_accd',
 '#NAV_1207455997_0_accd',
 '#NAV_1529793603_0_accd',
 '#NAV_1713211272_0_accd',
 '#NAV_1965438184_0_accd',
 '#NAV_2133229556_0_accd',
 '#NAV_2137395588_0_accd',
 '#NAV_246571490_0_accd',
 '#NAV_30093459_0_accd',
 '#NAV_437121255_0_accd',
 '#NAV_665794506_0_accd',
 '#NAV_820029714_0_accd',
 '#NAV_893921976_0_accd',
 '#https://www.census.gov/about.html',
 '#https://www.census.gov/data.html',
 '#https://www.census.gov/geography.html',
 '#https://www.census.gov/library.html',
 '#https://www.census.gov/newsroom.html',
 '#https://www.census.gov/programs-surveys.html',
 '#skipNav',
 '#skipNav-CMSContent',
 '#skipfooter',
 '#skipsideNav',
 '#skiptoplinks',
 '/about.html',
 '/data.html',
 '/data/tables/2017/demo/popest/nation-total.html',
 '/data/tables/2017/demo/popest/total-housing-units.html',
 '/data/tables/2017/demo/popest/total-puerto-rico-municipios.html',
 '/en.html',
 '/library/publications/2010/demo/p25-1138.html',
 '/library/publications/2010/demo/p2

In [5]:
#remove all internal links from dataset
for item in set(links): 
    if item.startswith('#'):
        links.remove(item)
        
#remove the "/" at the end of links to prevent duplication
for item in set(links): 
    if item.endswith('/'):
        links.remove(item)
links

{'/about.html',
 '/data.html',
 '/data/tables/2017/demo/popest/nation-total.html',
 '/data/tables/2017/demo/popest/total-housing-units.html',
 '/data/tables/2017/demo/popest/total-puerto-rico-municipios.html',
 '/en.html',
 '/library/publications/2010/demo/p25-1138.html',
 '/library/publications/2010/demo/p25-1139.html',
 '/library/publications/2015/demo/p25-1142.html',
 '/library/visualizations/2018/comm/july4.html',
 '/library/visualizations/2018/comm/midwest-counties.html',
 '/library/visualizations/2018/comm/youngest-oldest-counties.html',
 '/newsroom.html',
 '/newsroom/press-releases/2018/estimates-characteristics.html',
 '/newsroom/press-releases/2018/popest-characteristics.html',
 '/newsroom/press-releases/2018/popest-characteristics/popest-characteristics-sp.html',
 '/programs-surveys.html',
 '/programs-surveys/popest/about.html',
 '/programs-surveys/popest/about/challenge-program.html',
 '/programs-surveys/popest/about/faq.html',
 '/programs-surveys/popest/about/fscpe.html',
 

In [6]:
#relative links are converted to absolute URLs
url_list = links
url_update =[]
for link in url_list:
    if link.startswith('/'):
        url_update.append('https://www.census.gov' + link)
    else: 
        url_update.append(link)

url_update

['https://www.census.gov/topics/population/population-estimates.html',
 'https://www.census.gov/topics/public-sector/taxes.html',
 'https://www.census.gov/data/tables/2017/demo/popest/total-puerto-rico-municipios.html',
 'https://www.census.gov/topics/population/genealogy.html',
 'https://www.census.gov/programs-surveys/popest/library.html',
 'https://www.census.gov/topics/families/data.html',
 'https://www.census.gov/topics/housing/housing-vacancies.html',
 'https://www.census.gov/en.html',
 'https://www.census.gov/topics/education/about.html',
 'https://www.census.gov/about/regions/about.html',
 'https://www.census.gov/topics/employment/publications.html',
 'https://www.census.gov/programs-surveys/popest/data/errata-notes.html',
 'https://www.census.gov/topics/income-poverty/income-inequality.html',
 'https://www.census.gov/geography/interactive-maps.html',
 'https://www.census.gov/topics/international-trade/trade-regulations.html',
 'https://www.census.gov/programs-surveys.html',
 '

In [7]:
# Save to csv file
with open("census_url.csv","w") as f:
    wr = csv.writer(f,delimiter="\n")
    wr.writerow(url_update)

### Resources 

Python Tutorial: Web Scraping with BeautifulSoup and Requests 

https://www.youtube.com/watch?v=ng2o98k983k

Attribute Tags

https://stackoverflow.com/questions/43814754/python-beautifulsoup-how-to-get-href-attribute-of-a-element/43815538

Set() function and duplicate links

https://docs.python.org/2/library/sets.html

https://www.python-course.eu/sets_frozensets.php

https://www.programiz.com/python-programming/set

https://www.quora.com/What-is-an-elegant-way-to-iterate-over-a-list-removing-elements-as-you-go-in-Python

General Web Scraping Info

https://github.com/lorien/awesome-web-scraping/blob/master/python.md#url-and-network-address-manipulation

https://www.reddit.com/r/learnpython/comments/2mmphx/saving_beautifulsoup_output_to_txt_file/
