# Scraping Neighborhood Description from Wikipedia using BeautifulSoup Python Library

Python version: 3  
BeautifulSoup version: beautifulsoup4-4.6.0  
Installation: `pip install beautifulsoup4`

In [1]:
from bs4 import BeautifulSoup

## 1. Test with one URL

In [2]:
names = ["Brooklyn_Heights", "Hell%27s_Kitchen,_Manhattan", "Chelsea,_Manhattan"]
url_1 = "https://en.wikipedia.org/wiki/" + names[0]
print(url_1)

https://en.wikipedia.org/wiki/Brooklyn_Heights


In [3]:
from urllib.request import urlopen

url = url_1
content = urlopen(url).read()
soup = BeautifulSoup(content, "lxml")

In [4]:
soup.title

<title>Brooklyn Heights - Wikipedia</title>

In [5]:
print(soup.p.get_text())

Brooklyn Heights is an affluent residential neighborhood within the New York City borough of Brooklyn. Originally referred to as Brooklyn Village, it has been a prominent area of Brooklyn since 1834. The neighborhood is noted for its low-rise architecture and its many brownstone rowhouses, most of them built prior to the Civil War. It also has an abundance of notable churches and other religious institutions. Brooklyn's first art gallery, the Brooklyn Arts Gallery, was opened in Brooklyn Heights in 1958.[4] In 1965, a large part of Brooklyn Heights was protected from unchecked development by the creation of the Brooklyn Heights Historic District, the first such district in New York City. The district was added to the National Register of Historic Places in 1966.


## 2. Scrape 175 NYC zip code descriptions

In [6]:
import pandas as pd
import numpy as np

In [13]:
df = pd.read_csv("neighborhoods.csv", encoding = "ISO-8859-1")
df.head()

Unnamed: 0,ZIP,BORO,NEIGHBORHOOD,NAME,WIKILINK
0,10451,Bronx,High Bridge and Morrisania,Mott Haven,"Mott_Haven,_Bronx"
1,10452,Bronx,High Bridge and Morrisania,Highbridge,"Highbridge,_Bronx"
2,10453,Bronx,Central Bronx,Morris Heights,"Morris_Heights,_Bronx"
3,10454,Bronx,Hunts Point and Mott Haven,Mott Haven,"Mott_Haven,_Bronx"
4,10455,Bronx,Hunts Point and Mott Haven,Mott Haven,"Mott_Haven,_Bronx"


In [14]:
zipcode = df['ZIP'].tolist()
names = df['NAME'].tolist()
neighbors = df['WIKILINK'].tolist()

In [30]:
neighbors_desc = {}

for i in range(len(neighbors)):
    url_data = "https://en.wikipedia.org/wiki/" + neighbors[i]
    print(url_data)
    url = url_data
    content = urlopen(url).read()
    soup = BeautifulSoup(content, "lxml")
    
    if "Coordinates:" in soup.p.get_text():  # some pages start with coordinates info instead of description
        description = soup.find("p").findNext("p").get_text()
    else:
        description = soup.p.get_text()
        
    neighbors_desc[names[i]] = description
    

https://en.wikipedia.org/wiki/Mott_Haven,_Bronx
https://en.wikipedia.org/wiki/Highbridge,_Bronx
https://en.wikipedia.org/wiki/Morris_Heights,_Bronx
https://en.wikipedia.org/wiki/Mott_Haven,_Bronx
https://en.wikipedia.org/wiki/Mott_Haven,_Bronx
https://en.wikipedia.org/wiki/Morrisania,_Bronx
https://en.wikipedia.org/wiki/Tremont,_Bronx
https://en.wikipedia.org/wiki/Fordham,_Bronx
https://en.wikipedia.org/wiki/Hunts_Point,_Bronx
https://en.wikipedia.org/wiki/West_Farms,_Bronx
https://en.wikipedia.org/wiki/Morris_Park,_Bronx
https://en.wikipedia.org/wiki/Parkchester,_Bronx
https://en.wikipedia.org/wiki/Kingsbridge,_Bronx
https://en.wikipedia.org/wiki/Pelham_Bay_(neighborhood),_Bronx
https://en.wikipedia.org/wiki/Throggs_Neck
https://en.wikipedia.org/wiki/Wakefield,_Bronx
https://en.wikipedia.org/wiki/Norwood,_Bronx
https://en.wikipedia.org/wiki/Fordham,_Bronx
https://en.wikipedia.org/wiki/Pelham_Gardens,_Bronx
https://en.wikipedia.org/wiki/Woodlawn,_Bronx
https://en.wikipedia.org/wiki/Riv

https://en.wikipedia.org/wiki/Midwood,_Brooklyn
https://en.wikipedia.org/wiki/Red_Hook,_Brooklyn
https://en.wikipedia.org/wiki/Sunset_Park,_Brooklyn
https://en.wikipedia.org/wiki/Bedford_Stuyvesant,_Brooklyn
https://en.wikipedia.org/wiki/Mill_Basin,_Brooklyn
https://en.wikipedia.org/wiki/Brighton_Beach
https://en.wikipedia.org/wiki/Canarsie,_Brooklyn
https://en.wikipedia.org/wiki/Bushwick,_Brooklyn
https://en.wikipedia.org/wiki/Prospect_Heights,_Brooklyn
https://en.wikipedia.org/wiki/Starrett_City,_Brooklyn


In [16]:
i = 0

for k, v in neighbors_desc.items():
    print("Neighborhood: ", k)
    print(v)
    print("")
    
    i += 1
    if i > 10:
        break

Neighborhood:  Park Slope
Park Slope is a neighborhood in northwest Brooklyn, New York City. Park Slope is roughly bounded by Prospect Park and Prospect Park West to the east, Fourth Avenue to the west, Flatbush Avenue to the north, and Prospect Expressway to the south. Generally, the section from Flatbush Avenue to Garfield Place (the "named streets") is considered the "North Slope", the section from 1st through 9th Streets is considered the "Center Slope", and south of 10th Street, the "South Slope".[2][3][4] The neighborhood takes its name from its location on the western slope of neighboring Prospect Park. Fifth Avenue and Seventh Avenue are its primary commercial streets, while its east-west side streets are lined with brownstones and apartment buildings.[5]

Neighborhood:  Sheepshead Bay
Sheepshead Bay is a bay separating the mainland of Brooklyn, New York City, from the eastern portion of Coney Island, the latter originally a barrier island but now effectively an extension of th

In [17]:
get_desc = lambda x: neighbors_desc[x]

In [31]:
df['description'] = df['NAME'].map(get_desc)

In [32]:
df.head(10)

Unnamed: 0,ZIP,BORO,NEIGHBORHOOD,NAME,WIKILINK,description,fulllink
0,10451,Bronx,High Bridge and Morrisania,Mott Haven,"Mott_Haven,_Bronx","718, 347, 646","https://en.wikipedia.org/wiki/Mott_Haven,_Bronx"
1,10452,Bronx,High Bridge and Morrisania,Highbridge,"Highbridge,_Bronx",Highbridge is a residential neighborhood geogr...,"https://en.wikipedia.org/wiki/Highbridge,_Bronx"
2,10453,Bronx,Central Bronx,Morris Heights,"Morris_Heights,_Bronx",Morris Heights is a residential neighborhood l...,"https://en.wikipedia.org/wiki/Morris_Heights,_..."
3,10454,Bronx,Hunts Point and Mott Haven,Mott Haven,"Mott_Haven,_Bronx","718, 347, 646","https://en.wikipedia.org/wiki/Mott_Haven,_Bronx"
4,10455,Bronx,Hunts Point and Mott Haven,Mott Haven,"Mott_Haven,_Bronx","718, 347, 646","https://en.wikipedia.org/wiki/Mott_Haven,_Bronx"
5,10456,Bronx,High Bridge and Morrisania,Morrisania,"Morrisania,_Bronx",Morrisania (/mɒrɪˈseɪniə/ morr-i-SAY-nee-ə) is...,"https://en.wikipedia.org/wiki/Morrisania,_Bronx"
6,10457,Bronx,Central Bronx,Tremont,"Tremont,_Bronx","Tremont, is a residential neighborhood in the ...","https://en.wikipedia.org/wiki/Tremont,_Bronx"
7,10458,Bronx,Bronx Park and Fordham,Fordham,"Fordham,_Bronx",Fordham is a group of neighborhoods located in...,"https://en.wikipedia.org/wiki/Fordham,_Bronx"
8,10459,Bronx,Hunts Point and Mott Haven,Hunts Point,"Hunts_Point,_Bronx",Not to be confused with Hunters Point in Queen...,"https://en.wikipedia.org/wiki/Hunts_Point,_Bronx"
9,10460,Bronx,Central Bronx,West Farms,"West_Farms,_Bronx",West Farms is a residential neighborhood in a ...,"https://en.wikipedia.org/wiki/West_Farms,_Bronx"


In [33]:
get_link = lambda x: "https://en.wikipedia.org/wiki/" + x
df['fulllink'] = df['WIKILINK'].map(get_link)

In [34]:
df.tail(10)

Unnamed: 0,ZIP,BORO,NEIGHBORHOOD,NAME,WIKILINK,description,fulllink
165,11230,Brooklyn,Borough Park,Midwood,"Midwood,_Brooklyn",Midwood is a neighborhood in the south-central...,"https://en.wikipedia.org/wiki/Midwood,_Brooklyn"
166,11231,Brooklyn,Northwest Brooklyn,Red Hook,"Red_Hook,_Brooklyn",Red Hook is a neighborhood in the New York Cit...,"https://en.wikipedia.org/wiki/Red_Hook,_Brooklyn"
167,11232,Brooklyn,Sunset Park,Greenwood,"Sunset_Park,_Brooklyn",Sunset Park is a neighborhood in the southwest...,"https://en.wikipedia.org/wiki/Sunset_Park,_Bro..."
168,11233,Brooklyn,Central Brooklyn,Stuyvesant Heights,"Bedford_Stuyvesant,_Brooklyn",Bedford–Stuyvesant (/ˈbɛdfərdˈstaɪvəsənt/; col...,https://en.wikipedia.org/wiki/Bedford_Stuyvesa...
169,11234,Brooklyn,Canarsie and Flatlands,Mill Basin,"Mill_Basin,_Brooklyn",Mill Basin is a residential neighborhood in Ne...,"https://en.wikipedia.org/wiki/Mill_Basin,_Broo..."
170,11235,Brooklyn,Southern Brooklyn,Brighton Beach,Brighton_Beach,Brighton Beach is an oceanside neighborhood in...,https://en.wikipedia.org/wiki/Brighton_Beach
171,11236,Brooklyn,Canarsie and Flatlands,Canarsie,"Canarsie,_Brooklyn",Canarsie (/kəˈnɑːrsi/ kə-NAR-see) is a working...,"https://en.wikipedia.org/wiki/Canarsie,_Brooklyn"
172,11237,Brooklyn,Bushwick and Williamsburg,Bushwick,"Bushwick,_Brooklyn",Bushwick is a working-class neighborhood in th...,"https://en.wikipedia.org/wiki/Bushwick,_Brooklyn"
173,11238,Brooklyn,Central Brooklyn,Prospect Heights,"Prospect_Heights,_Brooklyn",Prospect Heights is a neighborhood in the nort...,https://en.wikipedia.org/wiki/Prospect_Heights...
174,11239,Brooklyn,Canarsie and Flatlands,Starrett City,"Starrett_City,_Brooklyn",Starrett City (informally and colloquially kno...,"https://en.wikipedia.org/wiki/Starrett_City,_B..."


In [35]:
df.to_csv("zipdescription.csv", index=False)

In [26]:
# Note: some first paragraphs are not description - spot checks + corrected in csv