# Part 1: Data Source
### Summary of datasets:
| Name                                |    Data   Source                  | Source                                      |
|----------------------------------------|--------------------------------------|--------------------------------------------------------|
|     Taxi_Zones                           |   https://data.cityofnewyork.us/Transportation/NYC-Taxi-Zones/d3c5-ddgc                                |      NYC Open Data                                   |
|     Property Valuation and Assessment Data              |    https://data.cityofnewyork.us/City-Government/Property-Valuation-and-Assessment-Data/yjxr-fw8i                                |     NYC Open Data                                                  |
|     2015-2020 Yellow Taxi Trip Deta              |      https://data.cityofnewyork.us/browse?q=taxi&sortBy=relevance                                |     NYC Open Data                                                  |
|     School Dataset - USA Public School                    |     https://www.kaggle.com/datasets/carlosaguayo/usa-public-schools       |       Kaggle                                |
|     Hospitals Dataset - USA Hospitals                      |     https://www.kaggle.com/datasets/carlosaguayo/usa-hospitals                                |     Kaggle                                                |


In [1]:
import pandas as pd
import csv
import mysql.connector
from shapely.geometry import Point, Polygon
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm

# Part 2: Data pre-processing

### 2.1 Taxi Zones Deta:

In [2]:
df = pd.read_csv("dataSource/taxi_zones.csv")

In [3]:
df.head(5)

Unnamed: 0,OBJECTID,Shape_Leng,the_geom,Shape_Area,zone,LocationID,borough
0,1,0.116357,MULTIPOLYGON (((-74.18445299999996 40.69499599...,0.000782,Newark Airport,1,EWR
1,2,0.43347,MULTIPOLYGON (((-73.82337597260663 40.63898704...,0.004866,Jamaica Bay,2,Queens
2,3,0.084341,MULTIPOLYGON (((-73.84792614099985 40.87134223...,0.000314,Allerton/Pelham Gardens,3,Bronx
3,4,0.043567,MULTIPOLYGON (((-73.97177410965318 40.72582128...,0.000112,Alphabet City,4,Manhattan
4,5,0.092146,MULTIPOLYGON (((-74.17421738099989 40.56256808...,0.000498,Arden Heights,5,Staten Island


In [4]:
df.shape

(263, 7)

In [5]:
df.dtypes

OBJECTID        int64
Shape_Leng    float64
the_geom       object
Shape_Area    float64
zone           object
LocationID      int64
borough        object
dtype: object

- Filter relevant information outside of Manhattan.

In [6]:
filtered_df = df[df['borough'] == 'Manhattan']
filtered_df.to_csv('programing data/taxi_zones.csv', index=False)

- Saveing the data to the database

In [None]:
# Create MySQL database connection
conn = mysql.connector.connect(host='localhost', user='root', password='147258Xiao', database='Investment')
cursor = conn.cursor()

create_table_sql = """
CREATE TABLE Zones (
    OBJECTID INT,
    Shape_Leng FLOAT,
    The_geom GEOMETRY,
    Shape_Area FLOAT,
    Zone VARCHAR(255),
    LocationID INT,
    borough VARCHAR(255)
)
"""

cursor.execute(create_table_sql)

with open('programing data/taxi_zones.csv', 'r') as file:
    reader = csv.reader(file)
    next(reader)  # Skip header row
    for row in reader:
        object_id = int(row[0])
        shape_length = float(row[1])
        geom = row[2]  # The MULTIPOLYGON data as a string representation
        shape_area = float(row[3])
        zone = row[4]
        location_id = int(row[5])
        borough = row[6]

        # Insert data into the table
        insert_sql = "INSERT INTO Zones (OBJECTID, Shape_Leng, the_geom, Shape_Area, zone, LocationID, borough) VALUES (%s, %s, ST_GeomFromText(%s), %s, %s, %s, %s)"
        cursor.execute(insert_sql, (object_id, shape_length, geom, shape_area, zone, location_id, borough))
        conn.commit()

conn.close()

### 2.2 Property Valuation and Assessment Data:

#### Explain the meaning of each feature

- BBLE: It is a New York City real estate identification number.
- BORO: 1: Manhattan
- Block: Block number, representing the integer value of the block in which the property is located.
- LOT (Lot) : Lot number, indicating the integer value of the lot in which the property is located.
- EASEMENT: Land use right, which denotes the use right or restriction of land ownership.
- OWNER:Building owner.
- BLDGCL: Represents the class or building use code of a building
- TAXCLASS: Tax classification code, a string code used to identify the tax classification of the real estate.
- LTFRONT: The width of the lot in front of the real estate, expressed as an integer value in feet.
- LTDEPTH: The depth of a real estate plot, expressed as an integer value in feet.
- EXT: Extended information, which may be additional descriptions or features related to buildings or land.
- STORIES: The number of floors of the building
- FULLVAL: The full value of real estate, expressed as an integer value in US dollars.
- AVLAND: Land value, expressed as an integer value in US dollars.
- AVTOT: Total value, expressed as an integer value in US dollars.
- EXALND: Tax free land value, expressed as an integer value in US dollars.
- EXTOT: Total tax exemption value, expressed as an integer value in US dollars.
- EXCD1: Tax exemption class code (first assessment), a string code used to identify the tax exemption class.
- STADDR: Street address, a string representing the specific street address of the real estate.
- POSTCODE: Postal code denotes the postal code of the location of the real estate
- EXMPTCL: This is the tax exemption classification code used to identify the tax exemption category to which the property belongs.
- BLDFRONT: Width of the front of the building, expressed as an integer value in feet.
- BLDDEPTH: The depth of a building, expressed as an integer value in feet.
- AVLAND2: The total value of the second assessment, expressed as a floating point value in US dollars.
- AVTOT2: The total value of the second assessment, expressed as a floating point value in US dollars.
- EXLAND2: Tax land value (second assessment), expressed as a floating point value in US dollars.
- EXTOT2: Total duty-free value (second assessment), expressed as floating point value in US dollars.
- EXCD2 : Tax exemption class code (second assessment), a string code used to identify the tax exemption class.
- PERIOD : The time period in which the data was recorded, a string representing the time period in which the data was recorded.
- YEAR: Year of the data record, a string indicating the year of the record.
- VALTYPE: The value type of the data record, a string indicating the value type of the record.
- Borough: The administrative division in which the immovable property is located, a string indicating the administrative division.
- Latitude: The latitude of the real estate represents the latitude value expressed as a floating point value.
- Longitude: The longitude of real property, which represents the longitude value expressed as a floating point value.
- Community Board: Community board, denoting the administrative division unit of the district
- Council District: City Council District, which represents the division of the district in the city Council.
- Census Tract: Census area, which represents the division of the area in the census.
- BIN: The real property identification number, similar to the BBLE column, is used to uniquely identify each real property as a floating point value
- NTA: Community ID, a string indicating the community.
- New Georeferenced Column: New georeferenced column, a string column representing georeferenced information.

In [7]:
df = pd.read_csv('dataSource/Property_Valuation_and_Assessment_Data.csv')

In [8]:
df.head(5)

Unnamed: 0,BBLE,BORO,BLOCK,LOT,EASEMENT,OWNER,BLDGCL,TAXCLASS,LTFRONT,LTDEPTH,...,VALTYPE,Borough,Latitude,Longitude,Community Board,Council District,Census Tract,BIN,NTA,New Georeferenced Column
0,1000163859,1,16,3859,,"CHEN, QI TOM",R4,2,0,0,...,AC-TR,,,,,,,,,
1,1000730028,1,73,28,,NYC DSBS,V1,4,183,52,...,AC-TR,,,,,,,,,
2,1000730029,1,73,29,,NYC DSBS,Y7,4,90,500,...,AC-TR,,,,,,,,,
3,1000297504,1,29,7504,,,R0,2,36,73,...,AC-TR,,,,,,,,,
4,1000360012,1,36,12,,NYC DSBS,Y7,4,534,604,...,AC-TR,,,,,,,,,


- Check how many rows and columns dataset has.

In [9]:
df.shape

(9845857, 40)

- Check if there are duplicate rows or columns

In [10]:
#remove whitespace in or around feature names
df.columns = df.columns.str.replace(' ', '')

#check to ensure whitespaces have been removed
df.columns

Index(['BBLE', 'BORO', 'BLOCK', 'LOT', 'EASEMENT', 'OWNER', 'BLDGCL',
       'TAXCLASS', 'LTFRONT', 'LTDEPTH', 'EXT', 'STORIES', 'FULLVAL', 'AVLAND',
       'AVTOT', 'EXLAND', 'EXTOT', 'EXCD1', 'STADDR', 'POSTCODE', 'EXMPTCL',
       'BLDFRONT', 'BLDDEPTH', 'AVLAND2', 'AVTOT2', 'EXLAND2', 'EXTOT2',
       'EXCD2', 'PERIOD', 'YEAR', 'VALTYPE', 'Borough', 'Latitude',
       'Longitude', 'CommunityBoard', 'CouncilDistrict', 'CensusTract', 'BIN',
       'NTA', 'NewGeoreferencedColumn'],
      dtype='object')

In [11]:
#check for duplicate rows

#Print the number of duplicates, without the original rows that were duplicated
print('Number of duplicate (excluding first) rows in the table is: ', df.duplicated().sum())

# Use "keep=False" to mark all duplicates as true, including the original rows that were duplicated.
print('Number of duplicate rows (including first) in the table is:', df[df.duplicated(keep=False)].shape[0])

Number of duplicate (excluding first) rows in the table is:  0
Number of duplicate rows (including first) in the table is: 0


- Check if there is a constant column

In [12]:
#Check the data of category type to see if there is a constant column
df_columns = df.columns
features_card = list(df[df_columns].columns.values)

print('{0:35}  {1}'.format("Feature", "Unique Values"))
print('{0:35}  {1}'.format("-------", "--------------- \n"))

for c in df_columns:
    print('{0:35}  {1}'.format(c, str(len(df[c].unique()))))

Feature                              Unique Values
-------                              --------------- 

BBLE                                 1128885
BORO                                 5
BLOCK                                13985
LOT                                  6548
EASEMENT                             15
OWNER                                1470317
BLDGCL                               218
TAXCLASS                             11
LTFRONT                              1328
LTDEPTH                              1391
EXT                                  4
STORIES                              129
FULLVAL                              579432
AVLAND                               171876
AVTOT                                395031
EXLAND                               83255
EXTOT                                243174
EXCD1                                146
STADDR                               861582
POSTCODE                             239
EXMPTCL                              15
BLDFRONT  

The above result shows that PERIOD, VALTYPE are constant columns, so delete these two columns

In [13]:
columns_to_drop = ["PERIOD", "VALTYPE"]
df = df.drop(columns_to_drop, axis=1)

- Check %Missing column and %null column

In [14]:
# Prepare %Missing column and %null column
categorical_missing = {'Feature':[], 'Missing%':[], 'Null%':[], 'Total%':[], '0%':[]}
for column in df.columns:
    categorical_missing['Feature'].append(column)
    categorical_missing['Missing%'].append(100*sum(df[column]=='Missing')/df.shape[0])
    categorical_missing['Null%'].append(100*(df[column].isnull().sum())/df.shape[0])
    categorical_missing['Total%'].append((100*sum(df[column]=='Missing')/df.shape[0])+(100*(df[column].isnull().sum())/df.shape[0]))
    categorical_missing['0%'].append(100 * len(df[df[column] == 0]) / df.shape[0])
pd.DataFrame(categorical_missing)

Unnamed: 0,Feature,Missing%,Null%,Total%,0%
0,BBLE,0.0,0.0,0.0,0.0
1,BORO,0.0,0.0,0.0,0.0
2,BLOCK,0.0,0.0,0.0,0.0
3,LOT,0.0,0.0,0.0,0.0
4,EASEMENT,0.0,99.580453,99.580453,0.0
5,OWNER,0.0,2.190576,2.190576,0.0
6,BLDGCL,0.0,0.0,0.0,0.0
7,TAXCLASS,0.0,0.0,0.0,0.0
8,LTFRONT,0.0,0.0,0.0,16.022221
9,LTDEPTH,0.0,0.0,0.0,16.84412


###### BORO

In [15]:
df['BORO'].unique()

array([1, 2, 3, 4, 5])

- BORO is a field indicating the administrative division of real estate. It is commonly used to identify boroughs in New York City.
- 1, Manhattan; 2, Brooklyn; 3, Queens; 4, Bronx; 5, Staten Island.
- Since this project focuses on Manhattan, the data with BORO value 1 is filtered out.

In [16]:
df = df[df['BORO'] == 1]

In [17]:
df.drop(labels=['BORO'],axis=1,inplace=True)

###### EASEMENT

In [18]:
df['EASEMENT'].unique()

array([nan, 'E', 'G', 'F', 'A', 'H', 'I', 'N', 'K'], dtype=object)

- In the EASEMENT column, these values indicate the easement status of the property. An easement is a specific right or restriction on the use of land without ownership. These values represent the various easement types that may exist in the area where the property is located. NaN indicates missing values, that is, no easement information is available.
- Considering that over 99.5 % of buildings have no easement, this attribute has almost zero impact on the analysis, so this should be removed.

In [19]:
df.drop(labels=['EASEMENT'],axis=1,inplace=True)

###### OWNER

- The owner will not affect the value of the building, and in order to ensure personal privacy，this should be removed.

In [20]:
df.drop(labels=['OWNER'],axis=1,inplace=True)

###### EXMPTCL

In [21]:
df['EXMPTCL'].unique()

array([nan, 'X1', 'X4', 'X8', 'X6', 'X5', 'X2', 'VI', 'X7', 'X3', 'X9',
       'KI'], dtype=object)

- In the EASEMENT column, these values indicate the tax exemption category to which the property belongs. Nan values are guessed as no tax-exempt status.
- Since over 98.5% of the values are missing, the impact of this attribute on the analysis is almost zero, so this should be removed.

In [22]:
df.drop(labels=['EXMPTCL'],axis=1,inplace=True)

###### AVLAND, AVLAND2 and AVTOT, AVTOT2

- Both are the first valuation and the second valuation. It is speculated that the second valuation is vacant because the second valuation is the same as the first valuation. Therefore, fill the vacant values of the second valuation with the first valuation.

In [23]:
df.loc[df['AVLAND2'].isnull(), 'AVLAND2'] = df['AVLAND']
df.loc[df['AVTOT2'].isnull(), 'AVTOT2'] = df['AVTOT']

- If AVLAND, AVLAND2, AVTOT, AVTOT2 and FULLVAL are all empty, the house value cannot be determined. So  this should be removed.

In [24]:
columns = ['FULLVAL', 'AVLAND', 'AVTOT', 'AVLAND2', 'AVTOT2']
df = df[~(df[columns] == 0).all(axis=1)]

###### EXLAND, EXLAND2 and EXTOT, EXTOT2

- Same reason as above

In [25]:
df.loc[df['EXLAND2'].isnull(), 'EXLAND2'] = df['EXLAND']
df.loc[df['EXTOT2'].isnull(), 'EXTOT2'] = df['EXTOT']

###### Borough

In [26]:
df['Borough'].unique()

array([nan, 'MANHATTAN'], dtype=object)

- The administrative division in which the immovable property is located, a string indicating the administrative division.
- We have filtered out the real estate belonging to Manhattan by BORO, and the real estate data we know to be retained belongs to Manhattan. At the same time, we know that Borough has only two values: MANHATTAN and nan, so that column is useless for us. This should be removed.

In [27]:
df.drop(labels=['Borough'],axis=1,inplace=True)

###### NTA

- Since the addresses are associated, the "BLOCK" and "NTA" columns of all data are taken out to remove duplicate values and observe whether there is a corresponding relationship between the two columns

In [28]:
unique_combinations = df[['BLOCK', 'NTA']].drop_duplicates()
unique_combinations.to_csv('programing data/unique_combinations.csv', index=False)

In [29]:
pd.DataFrame(unique_combinations)

Unnamed: 0,BLOCK,NTA
0,16,
1,73,
4,36,
11,209,
18,274,
...,...,...
8911181,2038,
8912822,2031,
8916755,2028,
8918224,2025,


-  It is found that there are three cases, the BLOCK number corresponds to one NTA value, the BLOCK number corresponds to two values: null value and NTA name, and the BLOCK corresponds to three values: null value, NTA name 1 and NTA name 2.

In [30]:
block_counts = unique_combinations['BLOCK'].value_counts()
block_once = block_counts[block_counts == 1].index
block_twice = block_counts[block_counts == 2].index
block_thrice = block_counts[block_counts == 3].index

In [31]:
print("The total number of BLOCK values that appear once：", len(block_once))
print("The total number of BLOCK values that appear twice：", len(block_twice))
print("The total number of BLOCK values that appear Three：", len(block_thrice))

The total number of BLOCK values that appear once： 1367
The total number of BLOCK values that appear twice： 553
The total number of BLOCK values that appear Three： 34


- To fill the missing NTA values with another NTA value corresponding to the same BLOCK value when NTA is empty.

In [32]:
block_nta_mapping = df.groupby('BLOCK')['NTA'].first().to_dict()
df['NTA'] = df['NTA'].fillna(df['BLOCK'].map(block_nta_mapping))

- Re-save and observe the data group of [BLOCK-NTA]

In [33]:
unique_combinations = df[['BLOCK', 'NTA']].drop_duplicates()
unique_combinations.to_csv('programing data/unique_combinations.csv')

- The remaining vacancy values are filled by the NTA corresponding to the BLOCK with the closest value

In [34]:
df = df.sort_values('BLOCK', ascending=False)
df['NTA'] = df['NTA'].fillna(method='bfill')

- The LocationID is obtained according to the NTA

In [35]:
taxi_zone = pd.read_csv('programing data/taxi_zones.csv', usecols=['zone','LocationID'])
merged_df = pd.merge(df, taxi_zone, left_on='NTA', right_on = 'zone', how='inner')

In [36]:
merged_df['LocationID'].unique()

array([243, 244, 116, 152, 166,  74,  75, 148, 107, 249,  79,  45])

- There are 69 neighborhoods in Manhattan, and the data has been processed to show only 12 of them. The data cannot be used,  so this should be removed.

In [37]:
df.drop(labels=['NTA'],axis=1,inplace=True)

###### Latitude and Longitude

- The latitude and longitude cannot be filled with other data.
- Query for data where the latitude and longitude and New Georeferenced Column are empty.

In [38]:
df.shape

(1352026, 32)

In [39]:
filtered_data = df[df['Longitude'].isnull() & df['Latitude'].isnull()& df['NewGeoreferencedColumn'].isnull()]

In [40]:
filtered_data.shape

(10967, 32)

In [41]:
df[df['BBLE']=='1016440042']

Unnamed: 0,BBLE,BLOCK,LOT,BLDGCL,TAXCLASS,LTFRONT,LTDEPTH,EXT,STORIES,FULLVAL,...,EXTOT2,EXCD2,YEAR,Latitude,Longitude,CommunityBoard,CouncilDistrict,CensusTract,BIN,NewGeoreferencedColumn
4572147,1016440042,1644,42,V1,4,25,72,,,323000,...,0.0,,2014/15,40.798896,-73.940188,111.0,8.0,182.0,1000000.0,POINT (-73.940188 40.798896)
6736323,1016440042,1644,42,V1,4,25,72,,,301000,...,0.0,,2012/13,,,,,,,
3473558,1016440042,1644,42,V1,4,25,72,,,369000,...,0.0,,2015/16,40.798896,-73.940188,111.0,8.0,182.0,1000000.0,POINT (-73.940188 40.798896)
7835294,1016440042,1644,42,V1,4,25,72,,,296200,...,0.0,,2010/11,,,,,,,
5647665,1016440042,1644,42,V1,4,25,72,,,311217,...,0.0,,2013/14,,,,,,,
2394511,1016440042,1644,42,V1,4,25,72,,,389000,...,0.0,,2016/17,40.798896,-73.940188,111.0,8.0,182.0,1000000.0,POINT (-73.940188 40.798896)
171194,1016440042,1644,42,V1,4,25,72,,,452000,...,0.0,,2018/19,40.798896,-73.940188,111.0,8.0,182.0,1000000.0,POINT (-73.940188 40.798896)
1284215,1016440042,1644,42,V1,4,25,72,,,416000,...,0.0,,2017/18,40.798896,-73.940188,111.0,8.0,182.0,1000000.0,POINT (-73.940188 40.798896)
8896625,1016440042,1644,42,V1,4,25,72,,,296000,...,0.0,,2011/12,,,,,,,


- The data lacks latitude and longitude data. However, some data of the same building have latitude and longitude data and some do not. The same BBLE is used to group data and fill the missing latitude and longitude data.

In [42]:
grouped = df.groupby('BBLE')
for name, group in grouped:
    if group['Latitude'].nunique(dropna = False) != 1:
        valid_lat = group['Latitude'].dropna().drop_duplicates()
        valid_lon = group['Longitude'].dropna().drop_duplicates()
        df.loc[group.index, 'Latitude'] = group['Latitude'].fillna(valid_lat.values[0])
        df.loc[group.index, 'Longitude'] = group['Longitude'].fillna(valid_lon.values[0])

- Using latitude and longitude to determine what zone the real estate belongs to

In [43]:
# Loading the csv file of the zone data
region_data = pd.read_csv('programing data/taxi_zones.csv')
# creating a list of polygon objects for the area
region_polygons = []
region_ids = []  # area id list

for index, row in region_data.iloc[:].iterrows():
    geom_value = row['the_geom']
    cleaned_value = geom_value.lstrip('MULTIPOLYGON ')
    coordinates = cleaned_value.replace('(', '').replace(')', '').split(',')
    coordinates = [tuple(map(float, coord.strip().split())) for coord in coordinates]
    polygon = Polygon(coordinates)
    region_polygons.append(polygon)
    region_ids.append(row['LocationID'])

df['LocationID'] = None  # Creatinga new column and initialize it to None

for index, row in df.iterrows():
    if row['Latitude'] != "" and row['Longitude'] != "":
        house_latitude = row['Latitude']
        house_longitude = row['Longitude']
        house_point = Point(house_longitude, house_latitude)
    
    
        # Checking if the property is in either zone
        for i, polygon in enumerate(region_polygons):
            if house_point.within(polygon):
                df.at[index, 'LocationID'] = region_ids[i]
                break




- See if the building LocationID can be inferred from the same address

In [44]:
df.loc[df['STADDR'] == '1 AVENUE', "LocationID"].unique()

array([None, 75, 233, 224], dtype=object)

- It is observed that the same address may belong to 2-3 different neighborhoods, and it is impossible to predicted the neighborhood by the address.
- Remove data without latitude and longitude values

In [45]:
df = df[~((df["Latitude"].isna() & df["Longitude"].isna() ))]

###### YEAR

- Data from 2015 and beyond will be kept

In [46]:
df['YEAR'].unique()

array(['2010/11', '2011/12', '2017/18', '2016/17', '2018/19', '2013/14',
       '2014/15', '2015/16', '2012/13'], dtype=object)

In [47]:
years_to_delete = ['2013/14', '2012/13', '2010/11', '2011/12']
df = df[~df['YEAR'].isin(years_to_delete)]

- Check %Missing column and %null column

In [48]:
# Prepare %Missing column and %null column
categorical_missing = {'Feature':[], 'Missing%':[], 'Null%':[], 'Total%':[], '0%':[]}
for column in df.columns:
    categorical_missing['Feature'].append(column)
    categorical_missing['Missing%'].append(100*sum(df[column]=='Missing')/df.shape[0])
    categorical_missing['Null%'].append(100*(df[column].isnull().sum())/df.shape[0])
    categorical_missing['Total%'].append((100*sum(df[column]=='Missing')/df.shape[0])+(100*(df[column].isnull().sum())/df.shape[0]))
    categorical_missing['0%'].append(100 * len(df[df[column] == 0]) /df.shape[0])
pd.DataFrame(categorical_missing)

Unnamed: 0,Feature,Missing%,Null%,Total%,0%
0,BBLE,0.0,0.0,0.0,0.0
1,BLOCK,0.0,0.0,0.0,0.0
2,LOT,0.0,0.0,0.0,0.0
3,BLDGCL,0.0,0.0,0.0,0.0
4,TAXCLASS,0.0,0.0,0.0,0.0
5,LTFRONT,0.0,0.0,0.0,57.678799
6,LTDEPTH,0.0,0.0,0.0,58.061184
7,EXT,0.0,93.43059,93.43059,0.0
8,STORIES,0.0,3.068378,3.068378,0.0
9,FULLVAL,0.0,0.0,0.0,0.0


In [49]:
df.to_csv('programing data/Property_Valuation_and_Assessment_Data.csv', index=False)

### Selecting features：

- Look for rows and columns. Consider whether it makes sense to keep them or drop them.

Feature Selection Summary:

- Real Estate:
| Feature                                |    Data Plan                                         |
|----------------------------------------|--------------------------------------|
|     BBLE               |   Inherit BBLE from the original dataset                                          |     
|     BLDGCL             |   Inherit BLDGCL from the original dataset                                  |    
|     TAXCLASS           |   Inherit TAXCLASS from the original dataset                                  |    
|     EXT                |   Inherit EXT from the original dataset                                          |    
|     STORIES            |   Inherit STORIES from the original dataset                                  |    
|     FULLVAL            |   Inherit FULLVAL from the original dataset                                  |     
|     AVLAND             |   Average of AVLAND and AVLAND2 from the original dataset                         |    
|     AVTOT	             |   Average of AVTOT and AVTOT2 from the original dataset                         |    
|     EXLAND             |   Average of EXLAND and EXLAND2 from the original dataset                         |    
|     EXTOT              |   Average of EXTOT and EXTOT2 from the original dataset                         |        
|     YEAR               |   Inherit YEAR from the original dataset, slice the first 4 characters    |     
|     LocationID         |   Inherit LocationID from the original dataset                                  |     


In [50]:
new_df = df[['BBLE', 'BLDGCL', 'TAXCLASS', 'EXT', 'STORIES', 'FULLVAL', 'LocationID']]
new_df['AVLAND'] = (df['AVLAND'] + df['AVLAND2']) / 2
new_df['AVTOT'] = (df['AVTOT'] + df['AVTOT2']) / 2
new_df['EXLAND'] = (df['EXLAND'] + df['EXLAND2']) / 2
new_df['EXTOT'] = (df['EXTOT'] + df['EXTOT2']) / 2
new_df['YEAR'] = df['YEAR'].str[:4]
new_df.to_csv('programing data/Real_Estate.csv', index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['AVLAND'] = (df['AVLAND'] + df['AVLAND2']) / 2
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['AVTOT'] = (df['AVTOT'] + df['AVTOT2']) / 2
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['EXLAND'] = (df['EXLAND'] + df['EXLAND2']) / 2
A value is trying to be set on a copy of a sli

- NTA real estate
| Feature                                |    Data Plan                                         |
|----------------------------------------|--------------------------------------| 
|     BLDGCL             |   Class of building                                                              |          
|     AMOUNT             |   The total number of buildings                                                  |    
|     FULLVAL            |   The sum of the 'FULLVAL' of all the buildings in the area                    |     
|     AVLAND             |   The sum of the 'AVLAND' of all the buildings in the area                    |    
|     AVTOT	             |   The sum of the 'AVTOT' of all the buildings in the area                    |    
|     EXLAND             |   The sum of the 'EXLAND' of all the buildings in the area                    |    
|     EXTOT              |   The sum of the 'EXTOT' of all the buildings in the area                    |        
|     YEAR               |   Inherit YEAR from the original dataset                                          |     
|     LocationID         |   The NTA where the property is located    

In [51]:
df = new_df
new_df = None

In [52]:
new_df = df.groupby(['YEAR', 'BLDGCL', 'LocationID']).agg({
    'BBLE': 'count',  # Total number of statistics
    'AVLAND': 'sum',  
    'FULLVAL': 'sum',  
    'AVTOT': 'sum',
    'EXLAND':'sum',
    'EXTOT':'sum'  
}).reset_index()
new_df.columns = ['YEAR', 'BLDGCL', 'LocationID', 'AMOUNT', 'AVLAND', 'FULLVAL', 'AVTOT','EXLAND','EXTOT']
new_df.to_csv('programing data/NTA_Real_Estate.csv', index=False)

### 2.3 Taxi Trip Deta:

#### Explain the meaning of each feature
- VendorID: The identifier of the taxi provider or operator.
- tpep_pickup_datetime: The date and time of the pick-up.
- tpep_dropoff_datetime: The date and time of the drop-off.
- passenger_count: Number of passengers.
- trip_distance: Travel distance
- pickup_longitude: The longitude coordinate of the pick-up.
- pickup_latitude: The latitudinal coordinates of the pick-up.
- RatecodeID: A code that identifies the taxi rate.
- store_and_fwd_flag: Is a field in New York City taxi data that records a Boolean value indicating whether the trip was stored locally by the vehicle and subsequently forwarded to the dispatch center.
- dropoff_longitude: The longitude coordinate of the drop-off.
- dropoff_latitude: The latitudinal coordinates of the drop-off.
- payment_type: Payment method for passengers.
- fare_amount: The amount of the trip cost, excluding surcharges, tips, tolls, etc.
- extra: Additional charges, which may include parking fees, baggage fees, etc.
- mta_tax: Nyc Transit Authority (MTA) additional taxes and fees.
- tip_amount: The amount of a tip given by a passenger.
- tolls_amount: The amount of the toll.
- improvement_surcharge: An improvement surcharge, usually a fee charged under certain conditions.
- total_amount: The total cost, including the sum of the trip cost, additional fees, tips, tolls and other surcharges.
- PULocationID: The location ID of the pick-up.
- DOLocationID: The location ID of the drop-off.

#### New feature
- Year: Year of data generation.
- LocationID: The location ID.
- Pick_up: The count of the pick-up.
- Drop_off: The count of the drop-off.
- Pick_up_passenger: The number of passengers of pick_up.
- Drop_off_passenger: The number of passengers of drop_off.
- Pick_up_fare: The total cost of the number of passengers of pick_up.
- Drop_off_fare: The total cost of the number of passengers of drop_off.

### Cleaning data
Take 2016 for example

In [53]:
df = pd.read_csv("dataSource/2016_Yellow_Taxi_Trip_Data.csv")

  df = pd.read_csv("dataSource/2016_Yellow_Taxi_Trip_Data.csv")


In [54]:
df.head(5)

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RatecodeID,store_and_fwd_flag,dropoff_longitude,...,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,PULocationID,DOLocationID
0,1,08/12/2016 11:55:28 AM,08/12/2016 12:07:57 PM,1.0,0.8,,,1.0,N,,...,2.0,8.5,0.0,0.5,0.0,0.0,0.3,9.3,230.0,229.0
1,1,07/20/2016 02:36:59 PM,07/20/2016 03:26:37 PM,1.0,15.3,,,2.0,N,,...,1.0,52.0,0.0,0.5,10.55,0.0,0.3,63.35,162.0,132.0
2,2,07/05/2016 09:21:27 PM,07/05/2016 09:39:07 PM,2.0,3.56,,,1.0,N,,...,1.0,14.5,0.5,0.5,3.95,0.0,0.3,19.75,231.0,181.0
3,1,07/09/2016 11:39:16 PM,07/09/2016 11:44:16 PM,1.0,0.7,,,1.0,N,,...,2.0,5.0,0.5,0.5,0.0,0.0,0.3,6.3,186.0,234.0
4,2,07/11/2016 10:42:40 PM,07/11/2016 10:52:42 PM,5.0,1.27,,,1.0,N,,...,2.0,8.0,0.5,0.5,0.0,0.0,0.3,9.3,161.0,100.0


- Check how many rows and columns dataset has.

In [55]:
df.shape

(12867402, 21)

- Check database data types

In [56]:
df.dtypes

VendorID                  object
tpep_pickup_datetime      object
tpep_dropoff_datetime     object
passenger_count          float64
trip_distance            float64
pickup_longitude         float64
pickup_latitude          float64
RatecodeID               float64
store_and_fwd_flag        object
dropoff_longitude        float64
dropoff_latitude         float64
payment_type             float64
fare_amount              float64
extra                    float64
mta_tax                  float64
tip_amount               float64
tolls_amount             float64
improvement_surcharge    float64
total_amount             float64
PULocationID             float64
DOLocationID             float64
dtype: object

- Check if there are duplicate rows or columns

In [57]:
#remove whitespace in or around feature names
df.columns = df.columns.str.replace(' ', '')

#check to ensure whitespaces have been removed
df.columns

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'pickup_longitude',
       'pickup_latitude', 'RatecodeID', 'store_and_fwd_flag',
       'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount',
       'extra', 'mta_tax', 'tip_amount', 'tolls_amount',
       'improvement_surcharge', 'total_amount', 'PULocationID',
       'DOLocationID'],
      dtype='object')

In [58]:
#check for duplicate rows

#Print the number of duplicates, without the original rows that were duplicated
print('Number of duplicate (excluding first) rows in the table is: ', df.duplicated().sum())

# Use "keep=False" to mark all duplicates as true, including the original rows that were duplicated.
print('Number of duplicate rows (including first) in the table is:', df[df.duplicated(keep=False)].shape[0])

Number of duplicate (excluding first) rows in the table is:  1
Number of duplicate rows (including first) in the table is: 2


There is only one duplicate data, remove it

In [59]:
df.drop_duplicates(inplace = True)

- Check if there is a constant column

In [60]:
#Check the data of category type to see if there is a constant column
df_columns = df.columns
features_card = list(df[df_columns].columns.values)

print('{0:35}  {1}'.format("Feature", "Unique Values"))
print('{0:35}  {1}'.format("-------", "--------------- \n"))

for c in df_columns:
    print('{0:35}  {1}'.format(c, str(len(df[c].unique()))))

Feature                              Unique Values
-------                              --------------- 

VendorID                             10
tpep_pickup_datetime                 8323781
tpep_dropoff_datetime                8325244
passenger_count                      11
trip_distance                        4758
pickup_longitude                     1
pickup_latitude                      1
RatecodeID                           8
store_and_fwd_flag                   3
dropoff_longitude                    1
dropoff_latitude                     1
payment_type                         6
fare_amount                          2102
extra                                45
mta_tax                              12
tip_amount                           3930
tolls_amount                         1090
improvement_surcharge                5
total_amount                         12598
PULocationID                         262
DOLocationID                         263


The above result shows that pickup_longitude , pickup_latitude, dropoff_longitude, dropoff_latitude are constant columns, so delete these four columns

In [61]:
columns_to_drop = ["pickup_longitude", "pickup_latitude", "dropoff_longitude", "dropoff_latitude"]
df = df.drop(columns_to_drop, axis=1)

In [62]:
df.head(5)

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,PULocationID,DOLocationID
0,1,08/12/2016 11:55:28 AM,08/12/2016 12:07:57 PM,1.0,0.8,1.0,N,2.0,8.5,0.0,0.5,0.0,0.0,0.3,9.3,230.0,229.0
1,1,07/20/2016 02:36:59 PM,07/20/2016 03:26:37 PM,1.0,15.3,2.0,N,1.0,52.0,0.0,0.5,10.55,0.0,0.3,63.35,162.0,132.0
2,2,07/05/2016 09:21:27 PM,07/05/2016 09:39:07 PM,2.0,3.56,1.0,N,1.0,14.5,0.5,0.5,3.95,0.0,0.3,19.75,231.0,181.0
3,1,07/09/2016 11:39:16 PM,07/09/2016 11:44:16 PM,1.0,0.7,1.0,N,2.0,5.0,0.5,0.5,0.0,0.0,0.3,6.3,186.0,234.0
4,2,07/11/2016 10:42:40 PM,07/11/2016 10:52:42 PM,5.0,1.27,1.0,N,2.0,8.0,0.5,0.5,0.0,0.0,0.3,9.3,161.0,100.0


- Check %Missing column and %null column

In [63]:
# Prepare %Missing column and %null column
categorical_missing = {'Feature':[], 'Missing%':[], 'Null%':[], 'Total%':[]}
for column in df.columns:
    categorical_missing['Feature'].append(column)
    categorical_missing['Missing%'].append(100*sum(df[column]=='Missing')/df.shape[0])
    categorical_missing['Null%'].append(100*(df[column].isnull().sum())/df.shape[0])
    categorical_missing['Total%'].append((100*sum(df[column]=='Missing')/df.shape[0])+(100*(df[column].isnull().sum())/df.shape[0]))
pd.DataFrame(categorical_missing)

Unnamed: 0,Feature,Missing%,Null%,Total%
0,VendorID,0.0,0.0,0.0
1,tpep_pickup_datetime,0.0,3.1e-05,3.1e-05
2,tpep_dropoff_datetime,0.0,3.9e-05,3.9e-05
3,passenger_count,0.0,3.9e-05,3.9e-05
4,trip_distance,0.0,3.9e-05,3.9e-05
5,RatecodeID,0.0,3.9e-05,3.9e-05
6,store_and_fwd_flag,0.0,3.9e-05,3.9e-05
7,payment_type,0.0,3.9e-05,3.9e-05
8,fare_amount,0.0,3.9e-05,3.9e-05
9,extra,0.0,3.9e-05,3.9e-05


In [64]:
df[df['PULocationID'].isnull()]

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,PULocationID,DOLocationID
12867397,1,09/19/2016 12:28:38 {,,,,,,,,,,,,,,,
12867398,"""error"" : true",,,,,,,,,,,,,,,,
12867399,"""message"" : ""Internal error""",,,,,,,,,,,,,,,,
12867400,"""status"" : 500",,,,,,,,,,,,,,,,
12867401,},,,,,,,,,,,,,,,,


From the above results, we can see that these lost almost all of data, so these should be removed.

In [65]:
null_rows = df[df['PULocationID'].isnull()]
df = df.drop(null_rows.index)

- Check %Missing column and %null column again

In [66]:
# Prepare %Missing column and %null column
categorical_missing = {'Feature':[], 'Missing%':[], 'Null%':[], 'Total%':[]}
for column in df.columns:
    categorical_missing['Feature'].append(column)
    categorical_missing['Missing%'].append(100*sum(df[column]=='Missing')/df.shape[0])
    categorical_missing['Null%'].append(100*(df[column].isnull().sum())/df.shape[0])
    categorical_missing['Total%'].append((100*sum(df[column]=='Missing')/df.shape[0])+(100*(df[column].isnull().sum())/df.shape[0]))
pd.DataFrame(categorical_missing)

Unnamed: 0,Feature,Missing%,Null%,Total%
0,VendorID,0.0,0.0,0.0
1,tpep_pickup_datetime,0.0,0.0,0.0
2,tpep_dropoff_datetime,0.0,0.0,0.0
3,passenger_count,0.0,0.0,0.0
4,trip_distance,0.0,0.0,0.0
5,RatecodeID,0.0,0.0,0.0
6,store_and_fwd_flag,0.0,0.0,0.0
7,payment_type,0.0,0.0,0.0
8,fare_amount,0.0,0.0,0.0
9,extra,0.0,0.0,0.0


### Selecting features：

- Look for rows and columns. Consider whether it makes sense to keep them or drop them.

Feature Selection Summary:
- VendorID: The identifier of the taxi provider or operator, the feature is obviously not useful for analysis. Drop the feature.
- tpep_pickup_datetime: Retention year data
- tpep_dropoff_datetime: Retention year data
- passenger_count: Number of passengers, save.
- trip_distance: Travel distance, save.
- RatecodeID: Final rate code, drop it.
- store_and_fwd_flag: Is a field in New York City taxi data that records a Boolean value indicating whether the trip was stored locally by the vehicle and subsequently forwarded to the dispatch center. The feature is obviously not useful for analysis. Drop the feature.
- payment_type: The feature is obviously not useful for analysis. Drop the feature.
- fare_amount: It is part of the total amount and is not helpful for analysis, drop.
- extra: It is part of the total amount and is not helpful for analysis, drop.
- mta_tax: It is part of the total amount and is not helpful for analysis, drop.
- tip_amount: It is part of the total amount and is not helpful for analysis, drop.
- tolls_amount: It is part of the total amount and is not helpful for analysis, drop.
- improvement_surcharge: It is part of the total amount and is not helpful for analysis, drop.
- total_amount: The total amount to be paid, save.
- PULocationID: The location ID of the pick-up, save.
- DOLocationID: The location ID of the drop-off, save.

- Drop unselected columns

In [67]:
columns_to_drop = ["VendorID", "RatecodeID", "store_and_fwd_flag", "RatecodeID", "payment_type", "fare_amount", "extra", "mta_tax", "tip_amount",
                  "tolls_amount", "improvement_surcharge"]
df = df.drop(columns_to_drop, axis=1)

In [68]:
df.head(5)

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount,PULocationID,DOLocationID
0,08/12/2016 11:55:28 AM,08/12/2016 12:07:57 PM,1.0,0.8,9.3,230.0,229.0
1,07/20/2016 02:36:59 PM,07/20/2016 03:26:37 PM,1.0,15.3,63.35,162.0,132.0
2,07/05/2016 09:21:27 PM,07/05/2016 09:39:07 PM,2.0,3.56,19.75,231.0,181.0
3,07/09/2016 11:39:16 PM,07/09/2016 11:44:16 PM,1.0,0.7,6.3,186.0,234.0
4,07/11/2016 10:42:40 PM,07/11/2016 10:52:42 PM,5.0,1.27,9.3,161.0,100.0


- Clean up retained data.

    Since we only need data within Manhattan, excluding data from non-Manhattan pick-up and drop-off locations.

- Save updated/cleaned data frame to a new csv file.

In [None]:
df.to_csv("programing data/2016_Yellow_Taxi_Trip_Data.csv")

- Saving the processed data into the database

In [None]:
# Create MySQL database connection
conn = mysql.connector.connect(host='localhost', user='root', password='147258Xiao', database='Investment')
cursor = conn.cursor()

create_table_sql = """
CREATE TABLE Manhattan_ID (
    tpep_pickup_datetime DATETIME,
    tpep_dropoff_datetime DATETIME,
    passenger_count INT,
    trip_distance FLOAT,
    PULocationID INT,
    DOLocationID INT,
    fare_amount FLOAT,
    total_amount FLOAT 
)
"""

cursor.execute(create_table_sql)

with open('programing data/2016_Yellow_Taxi_Trip_Data.csv', 'r') as file:
    csv_reader = csv.reader(file)
    next(csv_reader)  # Skip the header line of the CSV file

    for row in csv_reader:
        
        # Extract the data from the CSV file
        tpep_pickup_datetime = datetime.strptime(row[0], '%m/%d/%Y %I:%M:%S %p')
        tpep_dropoff_datetime = datetime.strptime(row[1], '%m/%d/%Y %I:%M:%S %p')
        # Since there are very few missing values and the randomness of passengers is too strong, the number of passengers is filled with the lowest value (1).
        passenger_count = int(float(row[2]))
        trip_distance = float(row[3])
        PULocationID = int(float(row[6]))
        DOLocationID = int(float(row[7]))
        fare_amount = float(row[4])
        total_amount = float(row[5])

        # Insert data into a database table
        insert_query = "INSERT INTO Investment.Taxi_2016 (tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, PULocationID, DOLocationID, fare_amount, total_amount) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)"
        values = (tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, PULocationID, DOLocationID, fare_amount, total_amount)
        cursor.execute(insert_query, values)
        conn.commit()
        

- Summarize information: According to the LocationID summary information, the count of pick up, the count of drop off, the number of passengers of pick_up, the number of passengers of drop_off, the total cost of the number of passengers of pick_up, and the total cost of the number of passengers of drop_off in different regions are counted.

In [None]:

# Query pickup location statistics
query_pickup = """
    SELECT 
        PULocationID,
        COUNT(*) AS Pick_Up,
        SUM(passenger_count) AS Pick_up_passenger,
        SUM(fare_amount) AS Pick_up_fare
    FROM 
        Taxi_2020
    GROUP BY 
        PULocationID
"""
# Query drop-off location statistics
query_dropoff = """
    SELECT 
        DOLocationID,
        COUNT(*) AS Drop_Off,
        SUM(passenger_count) AS Drop_off_passenger,
        SUM(fare_amount) AS Drop_off_fare
    FROM 
        Taxi_2020
    GROUP BY 
        DOLocationID
"""

# Execute the query and get the results
cursor.execute(query_pickup)
df_pickup = cursor.fetchall()

cursor.execute(query_dropoff)
df_dropoff = cursor.fetchall()

# Closing the database Connection
conn.close()

# Convert the result to a DataFrame
df_pickup = pd.DataFrame(df_pickup, columns=['PULocationID', 'Pick_Up', 'Pick_up_passenger', 'Pick_up_fare'])
df_dropoff = pd.DataFrame(df_dropoff, columns=['DOLocationID', 'Drop_Off', 'Drop_off_passenger', 'Drop_off_fare'])

# Perform the association of pick-up and drop-off locations
df_merged = pd.merge(df_pickup, df_dropoff, left_on='PULocationID', right_on='DOLocationID', how='inner')

# Select the desired columns
df_result = df_merged[['PULocationID', 'Pick_Up', 'Drop_Off', 'Pick_up_passenger', 'Drop_off_passenger', 'Pick_up_fare', 'Drop_off_fare']]

In [None]:
df_result.head(5)

In [None]:
df_result.to_csv('programing data/2016_Summary.csv', index=False)

- Link database to create tables and insert information

In [None]:
# Create MySQL database connection
conn = mysql.connector.connect(host='localhost', user='root', password='147258Xiao', database='Investment')
cursor = conn.cursor()

create_table_sql = """
CREATE TABLE Taxi_Summary (
    Year INT,
    LocationID INT,
    Pick_Up INT,
    Drop_Off INT,
    Pick_up_passenger INT,
    Drop_off_passenger INT,
    Pick_up_fare FLOAT,
    Drop_off_fare FLOAT
)
"""

cursor.execute(create_table_sql)

In [None]:

with open('programing data/2016_Summary.csv', 'r') as file:
    reader = csv.reader(file)
    next(reader)  # Skip the header line
    for row in reader:
        locationID = int(row[0])
        pick_Up = row[1]
        drop_Off = row[2]
        pick_up_passenger = row[3]
        drop_off_passenger = row[4]
        pick_up_fare = row[5]
        drop_off_fare = row[6]

        # SQL statement to insert data
        insert_sql = "INSERT INTO Taxi_Summary (Year, LocationID, Pick_Up, Drop_Off, Pick_up_passenger, Drop_off_passenger, Pick_up_fare, Drop_off_fare) VALUES (2020, %s, %s, %s, %s, %s, %s, %s)"
        cursor.execute(insert_sql, (locationID, pick_Up, drop_Off, pick_up_passenger, drop_off_passenger, pick_up_fare, drop_off_fare))
        conn.commit()
        
conn.close()

In [None]:
conn = mysql.connector.connect(host='localhost', user='root', password='147258Xiao', database='Investment')
cursor = conn.cursor()
query = "SELECT * FROM Investment.Taxi_Summary"
df = pd.read_sql_query(query, conn)
df.to_csv('programing data/Taxi_Summary.csv', index=False, header=True)
conn.close()

### Data Standardization：

- Since the taxi data span and vary from year to year, Z-score is used to standardize the data.
- Z-score normalization is a commonly used method for normalizing data by transforming the data into a standard normal distribution with mean 0 and standard deviation 1. This method makes the distribution of the data conform to the standard normal distribution in statistics, so that the data of different scales and ranges can be analyzed comparably.

- Loading the cleaned taxi data and the taxi zone data
- Removing data that is not in the Manhattan area

In [71]:
df = pd.read_csv('programing data/Taxi_Summary.csv')
zone = pd.read_csv('programing data/taxi_zones.csv')
location_ids = zone['LocationID'].tolist()
df = df[df['LocationID'].isin(location_ids)]

In [72]:
columns_to_normalize = ['Pick_Up', 'Drop_Off', 'Pick_up_passenger', 'Drop_off_passenger', 'Pick_up_fare', 'Drop_off_fare']

# Create a new DataFrame to hold the normalized data
df_normalized = pd.DataFrame()

# The YEAR and LocationID columns are extracted from the raw data
df_normalized['Year'] = df['Year']
df_normalized['LocationID'] = df['LocationID']

# Calculate the mean and standard deviation
mean = df[columns_to_normalize].mean()
std = df[columns_to_normalize].std()

# Z-score normalization is used
for column in columns_to_normalize:
    df_normalized[column] = (df[column] - mean[column]) / std[column]

# Remerge the normalized data with YEAR and LocationID
df_normalized = df_normalized.merge(df[['Year', 'LocationID']], on=['Year', 'LocationID'], how='left')

# Output the normalized data
df_normalized.to_csv('programing data/Summary.csv', index=False)


In [73]:
df_normalized = pd.read_csv('programing data/Summary.csv')
df_normalized = df_normalized[~((df_normalized['Year'] == 2020))]
df_normalized = df_normalized.rename(columns = {'Year':'YEAR'})
df_estate = pd.read_csv('programing data/Real_Estate.csv')
df_estate = df_estate[~((df_estate['YEAR'] == 2014))]
df_estate['YEAR'] = df_estate['YEAR'].astype(int) + 1

  df_estate = pd.read_csv('programing data/Real_Estate.csv')


In [74]:
merged_df = pd.merge(df_normalized, df_estate, on=['YEAR', 'LocationID'])

In [76]:
X = merged_df[['Pick_Up', 'Drop_Off', 'Pick_up_passenger', 'Drop_off_passenger', 'Pick_up_fare', 'Drop_off_fare']]
y = merged_df['AVTOT']
X = sm.add_constant(X)
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                  AVTOT   R-squared:                       0.002
Model:                            OLS   Adj. R-squared:                  0.002
Method:                 Least Squares   F-statistic:                     203.3
Date:                Mon, 03 Jul 2023   Prob (F-statistic):          4.72e-260
Time:                        20:04:38   Log-Likelihood:            -1.0969e+07
No. Observations:              615778   AIC:                         2.194e+07
Df Residuals:                  615771   BIC:                         2.194e+07
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const               1.176e+06   1.88

- Conclusion: The relationship between taxi data and real estate data is weak, and the prediction model cannot be generated

- The taxi data is incomplete
    2016/7/1-2016/12/31
    2017/3/30-2017/12/31
    2018/2/14-2018/12/31
    2019/6/11-2019/12/31
    2020/1/1-2020/12/31
    