With a large dataset of tweets and a wide range of possible cities, we wanted to confirm that cities on people's accounts matched a list of all cities in the United States. 

In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 1000)

In [2]:
df = pd.read_csv('./combined_us.csv')
cities = pd.read_csv('./USA-cities-and-states/us_cities_states_counties.csv', sep = '|')

In [3]:
df

Unnamed: 0,status_id,user_id,created_at,screen_name,text,source,is_quote,is_retweet,favourites_count,retweet_count,...,place_full_name,place_type,followers_count,friends_count,account_created_at,verified,lang,is_reply,city,state
0,1255648063143710723,92677101,2020-04-30 00:00:01+00:00,lopezgovlaw,#SurveySunday: Most Miamians Think #Coronaviru...,Twitter for iPhone,False,False,8132,2,...,"Miami, FL",city,16007,3634,2009-11-26 03:42:10+00:00,True,en,False,Miami,FL
1,1255648072090169345,217987384,2020-04-30 00:00:03+00:00,maisondecorla,*Breaking News* This just in - Stonewall Kitch...,Instagram,False,False,67,0,...,"Boutte, LA",city,105,146,2010-11-21 03:03:54+00:00,False,en,False,Boutte,LA
2,1255648079790903297,62657173,2020-04-30 00:00:05+00:00,drosssports,"Once again, while it wasn’t @DrDavidKatz , @to...",Twitter for iPhone,True,False,38651,0,...,"Chicago, IL",city,6441,501,2009-08-03 23:44:39+00:00,True,en,False,Chicago,IL
3,1255648164809445382,988863378427990017,2020-04-30 00:00:25+00:00,prana_mani,"Free #webinar announcement! This Sunday, May 3...",Twitter for iPhone,False,False,159,1,...,"Troy, NY",city,23,93,2018-04-24 19:32:56+00:00,False,en,False,Troy,NY
4,1255648197835251713,475393715,2020-04-30 00:00:33+00:00,KPobgyndoc,Went on a #COVID19 walk with my #mentor on her...,Twitter for iPhone,False,False,6336,2,...,"Oakland, CA",city,7161,6670,2012-01-27 00:49:58+00:00,False,en,False,Oakland,CA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
155374,1244413925912907778,160304538,2020-03-29 23:59:34+00:00,iskandrah,"I don't know who needs to hear this, but #coro...",Twitter for Android,False,False,171376,4,...,"Illinois, USA",admin,18163,7695,2010-06-27 20:18:34+00:00,True,en,False,unspecified,IL
155375,1244413952945250305,862769187499986944,2020-03-29 23:59:41+00:00,Customer_Effort,Trying my #comfilife #coccyx #cushion #sciatic...,Twitter for iPhone,False,False,95,0,...,"Minnesota, USA",admin,29,48,2017-05-11 20:39:38+00:00,False,en,False,unspecified,MN
155376,1244413955172372481,80139649,2020-03-29 23:59:41+00:00,rickeybevington,I am surprised too &gt; “Local leaders and som...,Twitter for iPad,True,False,15021,1,...,"Atlanta, GA",city,4199,1858,2009-10-05 22:12:28+00:00,False,en,False,Atlanta,GA
155377,1244414009736118277,32232449,2020-03-29 23:59:54+00:00,djfenixchicago,Another 30! #socialdistancing #coronavirus #co...,Twitter for Android,False,False,1077,0,...,"Chicago, IL",city,819,217,2009-04-17 01:00:08+00:00,False,en,False,Chicago,IL


In [4]:
cities

Unnamed: 0,City,State short,State full,County,City alias
0,Holtsville,NY,New York,SUFFOLK,Internal Revenue Service
1,Holtsville,NY,New York,SUFFOLK,Holtsville
2,Adjuntas,PR,Puerto Rico,ADJUNTAS,URB San Joaquin
3,Adjuntas,PR,Puerto Rico,ADJUNTAS,Jard De Adjuntas
4,Adjuntas,PR,Puerto Rico,ADJUNTAS,Colinas Del Gigante
...,...,...,...,...,...
63206,Klawock,AK,Alaska,PRINCE OF WALES HYDER,Klawock
63207,Metlakatla,AK,Alaska,PRINCE OF WALES HYDER,Metlakatla
63208,Point Baker,AK,Alaska,PRINCE OF WALES HYDER,Point Baker
63209,Ward Cove,AK,Alaska,KETCHIKAN GATEWAY,Ward Cove


The cities dataset contains a list of every city, state, county, and many alternative city names in the United States. The dataset was put together by grammakov and can be found unaltered on their [github](https://github.com/grammakov/USA-cities-and-states)

In [5]:
city = list(cities['City'])
city = list(set(city))

In [6]:
del city[0]

In [9]:
df_c = list(df['city'])
df_c = list(set(df_c))

Each list or column of city names was converted to a set in order to compare overlapping city names without taking up an immense amount of time iterating over hundred or thousands of duplicates. 

In [10]:
len(df_c)

5628

There were 5628 unique city names used in our twitter dataset. 

In [11]:
# https://medium.com/better-programming/a-visual-guide-to-set-comparisons-in-python-6ab7edb9ec41
inter = set(df_c).intersection(set(city))

In [12]:
inter

{'Delray Beach',
 'Bath',
 'Lemoyne',
 'Redford',
 'Jupiter',
 'Chalmette',
 'Port Lavaca',
 'Altadena',
 'Mukilteo',
 'Mendota',
 'Litchfield Park',
 'Southbridge',
 'Thatcher',
 'Lake Elsinore',
 'Oxford',
 'Wellington',
 'Biddeford',
 'Shelbyville',
 'Port Chester',
 'Pepperell',
 'Lamar',
 'Windham',
 'Roanoke',
 'Herculaneum',
 'Wellesley',
 'Hemet',
 'Marietta',
 'Elizabeth',
 'Hagerstown',
 'Dryden',
 'South Elgin',
 'Sunland Park',
 'Westport',
 'Mountain Home',
 'Sunset Beach',
 'Redan',
 'Groveport',
 'Hurricane',
 'Ipswich',
 'Muldraugh',
 'North Palm Beach',
 'Selinsgrove',
 'Port Salerno',
 'South Sioux City',
 'Goshen',
 'Lynwood',
 'Harrisonburg',
 'Bayville',
 'Justice',
 'Syracuse',
 'Suffolk',
 'Fernandina Beach',
 'Chino',
 'Three Oaks',
 'Santa Cruz',
 'Addison',
 'Severn',
 'Pelham',
 'Pine Island',
 'Reisterstown',
 'Hanover',
 'Weslaco',
 'Crown Point',
 'Picayune',
 'Howard',
 'Pflugerville',
 'Swansea',
 'Wyncote',
 'Brackenridge',
 'Granger',
 'Galt',
 'San Le

In [13]:
len(inter)

4159

Inter shows all cities that are present in both the city locations listed on Twitter and cities in the master list of US cities. With only 4159 cities overlapping, there is clearly some discrepency between the lists on twitter and the master list of cities. 

In [14]:
no_intersection = list(set(df_c) - set(city))

In [15]:
len(no_intersection)

1469

In [16]:
no_intersection

['Progress',
 'Grosse Pointe Farms',
 'Surfside Beach',
 'Orland Hills',
 'Fairwood',
 'King of Prussia',
 'Middleborough Center',
 'Blasdell',
 'Westwood Shores',
 'Hide-A-Way Lake',
 'Belville',
 'Dix Hills',
 'Indian Shores',
 'Plandome',
 'North St Paul',
 'Scott Township',
 'Ingalls Park',
 'Shoreline',
 'Country Homes',
 'McKee City',
 'Buckhall',
 'Delhi Hills',
 'Madeira',
 'North Shore',
 'Grandview Heights',
 'Dargin',
 'Big Coppitt Key',
 'Fort Carson',
 'Orange Blossom Hills',
 'Lake Success',
 'Security',
 'McLean',
 'Blennerhassett',
 'Harahan',
 'Clyde Hill',
 'Green Park',
 'Neptune City',
 'Wagon Wheel',
 'Farmington Hills',
 'Bull Run',
 'Walker Mill',
 'Howland Center',
 'Avilla Beach',
 'Agua Fria',
 'Bells Cross Roads',
 'Blue Ash',
 'Ocean Pines',
 'Sunny Isles Beach',
 'Lutherville',
 'Oroville East',
 'Hoover',
 'Dale City',
 'Kendall West',
 'Dakota Ridge',
 'St Matthews',
 'Branchburg',
 'Pasatiempo',
 'Phalanx',
 'Lynnwood-Pricedale',
 'Picnic Point',
 'North

No_intersection shows every city listed as a location on twitter that was not found in our master list of US cities. Looking through a number of these 'cities' that were found on Twitter but not on the master list revealed that many were either neighborhoods or 'census designated places'. 'Census designated places' or cdp's are unincorporated communities who may have a colloquial name but only legally exist for census collection purposes. They are not self-governing municipalities and thus would not appear on a list of US cities. Likewise, neighborhoods of major cities may act as common reference for where someone lives, but they rarely have their own administrations separate from the cities they are located within. 

In [17]:
check = set(city).intersection(no_intersection)

In [18]:
check

set()

We no longer have any intersecting cities between our two sets of cities. 