# sorting

This notebook is used to sort through the listings from craigslist and identify those that mention parks in their description

## step one: data description clean up
### listings
1. Load the listing data from the scraping notebook 
2. remove extra spaces and symbols and set the text to all lowercase. 
3. delete duplicate listings 

In [298]:
listingsDf_trim = pd.read_pickle('combined_df.pkl')

In [299]:
import re

def remove_and_clean(text):
    text = re.sub(("\n"), " ", text)
    text = text.lower()
    return text

listingsDf_trim['description'] = listingsDf_trim['description'].apply(remove_and_clean)

In [300]:
listingsDf_trim.drop_duplicates(subset='description', inplace=True)

## step two: start sorting

we want to sort the listings to see insantances when parks are mentioned. to do so we:
>- load a parks file from the county data hub with the names and loactions of all the parks in the county. 
>- create a list of neighborhoods in Los Angeles that are named after parks to exclude from our "parks_True" category
>- create a list of words that allude to parks to inlcude in our "parks_True" category
>- make a function that will read through all the descriptions and identify those that mention a park. the function will identify the listings that mention a park by putting a "true" vlaue in a new column Park_TF.
>- compliment the above fonction with another that will identify what in the decription that flagged it as mentioning a park. this was mainly used to check work. 
>- check and see what we are capturing with a count of the numebr of listing with parks mentioend using value counts

In [301]:
import geopandas as gpd

parks = gpd.read_file('county_parks.geojson')
parks.sample(2)

Unnamed: 0,OBJECTID,UNIT_ID,LMS_ID,PARK_NAME,PARK_LBL,ACCESS_TYP,RPT_ACRES,GIS_ACRES,AGNCY_NAME,AGNCY_LEV,...,CENTER_LON,ADDRESS,CITY,ZIP,HOURS,PHONES,IS_COUNTY,Shape__Area,Shape__Length,geometry
118,119,2996.0,23619.0,Kuns Park,Kuns Park,Open Access,2.362,2.362307,"La Verne, City of",City,...,-117.778044,1600 Bonita Ave,La Verne,91750,,,No,102901.676758,1285.292634,"POLYGON ((-117.77736 34.10445, -117.77766 34.10367, -117.77875 34.10395, -117.77846 34.10473, -117.77841 34.10472, -117.77736 34.10445))"
1201,1202,7281.0,,Pamplico Park,Pamplico Park,Open Access,7.583,7.582048,"Santa Clarita, City of",City,...,-118.527578,22444 Pamplico Dr,Santa Clarita,91350,,,No,330272.678711,2349.981665,"POLYGON ((-118.52680 34.45165, -118.52676 34.45165, -118.52670 34.45165, -118.52666 34.45165, -118.52659 34.45165, -118.52659 34.45164, -118.52660 34.45123, -118.52660 34.45120, -118.52660 34.45118, -118.52660 34.45115, -118.52660 34.45113, -118.52661 34.45111, -118.52661 34.45109, -118.52660 34.45106, -118.52662 34.45104, -118.52661 34.45102, -118.52663 34.45100, -118.52662 34.45097, -118.52664 34.45095, -118.52663 34.45092, -118.52665 34.45091, -118.52666 34.45089, -118.52666 34.45086, -118.52667 34.45084, -118.52668 34.45083, -118.52669 34.45080, -118.52670 34.45078, -118.52671 34.45076, -118.52672 34.45073, -118.52673 34.45072, -118.52674 34.45070, -118.52675 34.45067, -118.52676 34.45065, -118.52677 34.45063, -118.52678 34.45062, -118.52680 34.45060, -118.52752 34.44954, -118.52790 34.44970, -118.52812 34.44980, -118.52850 34.44995, -118.52871 34.45004, -118.52881 34.45009, -118.52771 34.45187, -118.52766 34.45185, -118.52762 34.45183, -118.52758 34.45181, -118.52752 34.45181, -118.52748 34.45179, -118.52743 34.45177, -118.52740 34.45176, -118.52739 34.45176, -118.52734 34.45175, -118.52729 34.45174, -118.52728 34.45174, -118.52724 34.45173, -118.52719 34.45171, -118.52715 34.45170, -118.52710 34.45169, -118.52705 34.45169, -118.52701 34.45168, -118.52695 34.45167, -118.52690 34.45167, -118.52686 34.45166, -118.52680 34.45165))"


There are a bunch of neighborhoods (and transit stations) in Los Angeles that are named after a park.. tricky. We wanted to take these listings out and any terms luxury developers use to reffer to their ammenities that aren't public (for example "rooftop park") this process of identifying terms required a lot of trial and errror to read through sample posts and identify trends

In [302]:
excluded = ['highland park',
           'hancock park',
           'echo park and silverlake',
            'echo park',
            'silverlake and echo park',
            'silverlake, echo park',
            'echo park, silverlake',
           'rancho park',
           'macarthur park station',
           'south park',
           'rooftop park',
           'hollywood bowl']

There are some listings that allude to a park or many parks without naming the exact one. We wanted ot inlcude those listings in our analysis so we created a list of terms that allude to parks called "park_terms". This is also a little tricky becuase you want to avoid counting anything that is referring to parking. 

In [303]:
park_terms = ['hike', 
                'hiking',
                'trail', 
                'trails', 
                'community park',
                'community parks', 
                'local park',
                'local parks',
                 'parks nearby',
                'public parks',
                'public park',
                'echo park lake',
                'park access',
                'parks',
              'recreational parks',
              'a park nearby',
              'the park',
              'park nearby',
              'griffith park',
              'exposition park',
              'recreation center',
              'rergional park',
              'griffith park,'
                 ]

This function reads descriptions and makes every listing that has a park name from the parks data file "true" unless its one of the excluded terms for neigborhoods named after parks ie highland park, then adds all the posts that have one of the park terms as true.

In [None]:
listingsDf_trim['park_TF'] = listingsDf_trim['description'].apply(
    lambda x: any(substring in x for substring in parks['PARK_NAME'].str.lower()) and not any(substring in x for substring in excluded)or any(substring in x for substring in park_terms))

We also want to know why the function named the listing true so we added another column called park_T_why which fills in the terms that flagged the description as mentioning a park

## step three: look at what we found

>- compliment the above fonction with another that will identify what in the decription that flagged it as mentioning a park. this was mainly used to check work. 
>- check and see what we are capturing with a count of the numebr of listing with parks mentioend using value counts
>- look at a sample of the listings that were flagged as true and make sure it makes sense
>- map the listings to see what it looks like on a map
>- export sorted listings for futher analysis

In [None]:
listingsDf_trim['park_T_why'] = listingsDf_trim['description'].apply(
    lambda x: [substring for substring in parks['PARK_NAME'].str.lower() if substring in x and substring not in excluded] or [substring for substring in park_terms if substring in x]
)

In [None]:
listingsDf_trim.park_TF.value_counts()

In [None]:
listsingsDf_trim_parks = listingsDf_trim[listingsDf_trim['park_TF'] == True]
pd.set_option('max_colwidth', None)
listsingsDf_trim_parks[['description', 'park_T_why']].head()

In [None]:
listingsDf_trim = listingsDf_trim[listingsDf_trim['lat'] != 'NA']
gdf = gpd.GeoDataFrame(listingsDf_trim, geometry=gpd.points_from_xy(listingsDf_trim.long, listingsDf_trim.lat), crs="EPSG:4326")

In [None]:
# import libraries
import plotly.express as px
import matplotlib.pyplot as plt


# create scatter map
fig = px.scatter_mapbox(gdf, 
                        lat=test.geometry.y, 
                        lon=test.geometry.x, 
                        color="park_TF",
                        mapbox_style="carto-positron",
                        zoom=9,
                        center = {"lat": 34, "lon": -118.4},
                        opacity=.8,
                        color_discrete_sequence=px.colors.qualitative.Vivid,
                        hover_data={"park_T_why": True}
                       )
                               

# options on the layout
fig.update_layout(
        width = 900,
        height = 700,
        title = "Listings",
        title_x = .5
    )

fig.update_traces(
    hovertemplate="Park mentioned: %{customdata[0]}<extra></extra>"
)

fig.update_layout(legend_title_text="Park Mentioned")

fig.show()

In [None]:
listingsDf_trim.to_pickle("./listings_sorted.pkl")