[Maven Analytics](https://www.mavenanalytics.io/blog/maven-taxi-challenge?utm_source=email&utm_campaign=taxichallengelaunch_email_udemy)

[Dataset](https://www.mavenanalytics.io/data-playground)

<a id="section0.1"></a>
### Tabla of contents:

[Data Dictionary](#sectionA)


----

#### About the dataset

* This dataset contains 6 tables in csv format, along with a geospatial map in TopoJSON and Shapefile formats

* The 4 Taxi Trips tables contain a total of 28 million Green Taxi trips in New York City from 2017 to 2020. Each record represents one trip, with fields containing details about the pick-up/drop-off times and locations, distances, fares, passengers, and more

* The 454 Calendar table contains a fiscal calendar (2017-2020) used by the Taxi & Limousine Commission, with fields containing the date and fiscal year, quarter, month, and week

* The Taxi Zones table contains information about 265 zone locations in New York City, including the location id, borough, and service zone

* The Taxi Zones Map files contain a map of New York City with divisions for the 265 locations that can be used to create custom map visuals in Power BI (TopoJSON) or Tableau (Shapefile)

---

#### How to play the Maven Taxi Challenge
For the Maven Taxi Challenge, you’ll be playing the role of a new Data Analyst for the New York City Taxi & Limousine Commission. It's your first week on the job, and you just received the following email from the Lead Dispatcher:


##### Welcome to the team!

We’ve been collecting trip data for ~4 years now, but without a proper analyst we haven’t been able to put it to good use. That's where you come in!

The raw data has some issues, so we'll need to make the following adjustments and assumptions to clean and prep the data:
<a id="section0"></a>

+ [Let’s stick to trips that were NOT sent via “store and forward”](#section1)
+ [I’m only interested in street-hailed trips paid by card or cash, with a standard rate](#section2)
+ [We can remove any trips with dates before 2017 or after 2020, along with any trips with pickups or drop-offs into unknown zones](#section3)
+ [Let’s assume any trips with no recorded passengers had 1 passenger](#section4)
+ [If a pickup date/time is AFTER the drop-off date/time, let’s swap them](#section5)
+ [We can remove trips lasting longer than a day, and any trips which show both a distance and fare amount of 0](#section6)
+ [If you notice any records where the fare, taxes, and surcharges are ALL negative, please make them positive](#section7)
+ [For any trips that have a fare amount but have a trip distance of 0, calculate the distance this way: (Fare amount - 2.5) / 2.5](#section8)
+ [For any trips that have a trip distance but have a fare amount of 0, calculate the fare amount this way: 2.5 + (trip distance x 2.5)](#section9)

Once the data is cleaned up, I’m hoping you can build me a dashboard to help with weekly planning and logistics. For any given fiscal week, I'd like to be able to use historical data to answer the following questions:

+ What's the average number of trips we can expect this week?
+ What's the average fare per trip we expect to collect?
+ What's the average distance traveled per trip?
+ How do we expect trip volume to change, relative to last week?
+ Which days of the week and times of the day will be busiest?
+ What will likely be the most popular pick-up and drop-off locations?
+ I realize this is a lot to ask for, but this type of analysis will have a huge impact on our business!

Thanks in advance,

Mario Maven (Lead Dispatcher, NYC Green Taxis)

-------

For this challenge, your task is to build a dashboard that meets Mario's requirements, and share a single page screenshot for any given fiscal week.

Here’s how to submit your entry:

Share a LinkedIn post mentioning **@Maven Analytics** and the hashtag **#maventaxichallenge**, with your single page dashboard based on the challenge objective above
Complete the official challenge submission form to make sure you are entered for a chance to win
Make sure to follow Maven Analytics on LinkedIn for updates on the challenge and invite your connections to play along!

In [1]:
import pandas as pd
import numpy as np

[Tabla of contents](#section0.1)
<a id="sectionA"></a>
### Data Dictionary

| Field | Description |
| --- | --- |
| VendorID | A code indicating the LPEP provider that provided the record (1= Creative Mobile Technologies, LLC; 2= Verifone Inc.) |
| lpep_pickup_datetime | The date and time when the meter was engaged 
| lpep_dropoff_datetime | The date and time when the meter was disengaged | store_and_fwd_flag | This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server (Y= store and forward trip; N= not a store and forward trip) 
| RatecodeID | The final rate code in effect at the end of the trip (1= Standard rate; 2= JFK; 3= Newark; 4= Nassau or Westchester; 5= Negotiated fare; 6= Group ride) 
| PULocationID | TLC Taxi Zone in which the taximeter was engaged 
| DOLocationID | TLC Taxi Zone in which the taximeter was disengaged 
| passenger_count | The number of passengers in the vehicle (this is a driver entered value) 
| trip_distance | The elapsed trip distance in miles reported by the taximeter 
| fare_amount | The time-and-distance fare calculated by the meter 
| extra | Miscellaneous extras and surcharges (this only includes the \\$0.50 and \$1 rush hour and overnight charges) 
| mta_tax | \$0.50 MTA tax that is automatically triggered based on the metered rate in use. 
| tip_amount | Tip amount (automatically populated for credit card tips - cash tips are not included). 
| tolls_amount | Total amount of all tolls paid in trip. 
| improvement_surcharge | \$0.30 improvement surcharge assessed on hailed trips at the flag drop. 
| total_amount | The total amount charged to passengers (does not include cash tips). 
| payment_type | A numeric code signifying how the passenger paid for the trip (1= Credit card; 2= Cash; 3= No charge; 4= Dispute; 5= Unknown; 6= Voided trip) 
| trip_type | A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver (1= Street-hail; 2= Dispatch) 
| congestion_surcharge | Congestion surcharge for trips that start, end or pass through the congestion zone in Manhattan, south of 96th street (\\$2.50 for non-shared trips in Yellow Taxis; \\$2.75 for non-shared trips in Green Taxis) | 

----

In [2]:
# importing the data

In [3]:
df_2017 = pd.read_csv("../data/taxi_trips/2017_taxi_trips.csv")

In [4]:
df_2018 = pd.read_csv("../data/taxi_trips/2018_taxi_trips.csv")

In [5]:
df_2019 = pd.read_csv("../data/taxi_trips/2019_taxi_trips.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [6]:
df_2020 = pd.read_csv("../data/taxi_trips/2020_taxi_trips.csv")

df_all_years = pd.concat([df_2017,df_2018,df_2019,df_2020], axis=0, ignore_index= True)

df_all_years

----

[Index](#section0)

---
<a id="section1"></a>
##### Let’s stick to trips that were NOT sent via “store and forward”

In [7]:
def not_store_and_forw(i):
    mask = i.store_and_fwd_flag == "N"
    i = i[mask]
    return i

[Index](#section0)

---
<a id="section2"></a>
##### I’m only interested in street-hailed trips paid by card or cash, with a standard rate

In [8]:
def by_card_or_cash(i):
    # street-hailed trips
    i = i[i["trip_type"] == 1]
    # with a standard rate
    i = i[i["RatecodeID"] == 1]
    # paid by card or cash
    mask1 = i.payment_type == 1
    mask2 = i.payment_type == 2
    i = i[mask1 | mask2]
    return i

[Index](#section0)

---
<a id="section3"></a>
##### We can remove any trips with dates before 2017 or after 2020, along with any trips with pickups or drop-offs into unknown zones

In [9]:
def before_2017_after_2020(i):
    # trips with dates before 2017
    mask1 = i["lpep_pickup_datetime"] >= '2017-01-01 00:00:00.000' 
    # trips with dates after 2020
    mask2 = i["lpep_pickup_datetime"] < '2021-01-01 00:00:00.000'
    i = i[mask1 & mask2]
    return i

[Index](#section0)

---
<a id="section4"></a>
##### Let’s assume any trips with no recorded passengers had 1 passenger

In [10]:
def trips_no_recorded_passengers(i):
    i["passenger_count"]=  i["passenger_count"].replace(0, 1)
    return i

[Index](#section0)

---
<a id="section5"></a>
##### If a pickup date/time is AFTER the drop-off date/time, let’s swap them

In [11]:
def missing_column(i):
    if 'congestion_surcharge' in i.columns:
        return i
    else:
        i["congestion_surcharge"] = "0"
        return i

In [12]:
def drop_off_date_time(i):
    mask1 = i["lpep_pickup_datetime"] <= i["lpep_dropoff_datetime"] 
    mask2 = i["lpep_pickup_datetime"] > i["lpep_dropoff_datetime"]
    df_goog = i[mask1]
    df_to_swap = i[mask2]
    df_to_swap.columns = ['VendorID', 'lpep_dropoff_datetime', 'lpep_pickup_datetime',
       'store_and_fwd_flag', 'RatecodeID', 'PULocationID', 'DOLocationID',
       'passenger_count', 'trip_distance', 'fare_amount', 'extra', 'mta_tax',
       'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount',
       'payment_type', 'trip_type', 'congestion_surcharge']
    i = pd.concat([df_goog, df_to_swap], ignore_index = False)
    return i

[Index](#section0)

---
<a id="section6"></a>
##### We can remove trips lasting longer than a day, and any trips which show both a distance and fare amount of 0

In [13]:
def dist_and_fare_amount_of_0(i):
    mask = (i["trip_distance"] != 0) | (i["fare_amount"] != 0)
    i= i[mask]
    return i

def remove_trips_longer_than_a_day(i):
    i["lpep_pickup_datetime"] = i["lpep_pickup_datetime"].astype('datetime64[ns]')
    i["lpep_dropoff_datetime"] = i["lpep_dropoff_datetime"].astype('datetime64[ns]')
    i["day_number_pickup"] = i["lpep_pickup_datetime"].dt.day
    i["day_number_dropoff"] = i["lpep_dropoff_datetime"].dt.day
    mask1 = i["day_number_pickup"] == i["day_number_dropoff"]
    mask2 = i["day_number_pickup"] != i["day_number_dropoff"]
    i = i[mask1]
    i = i.drop(labels = ["day_number_pickup", "day_number_dropoff"], axis = "columns")
    return i

[Index](#section0)

---
<a id="section7"></a>
##### If you notice any records where the fare, taxes, and surcharges are ALL negative, please make them positive

In [14]:
def looking_for_negative(i):
    negative2 = i.loc[(i["fare_amount"] < 0) & (i["mta_tax"] < 0) & (i["improvement_surcharge"] < 0)]
    return negative2

def convert_to_positiv(number):
    if number < 0:
        return number * -1
    else:
        return number

def changing_to_pos(i):
    columns = ['fare_amount', 'extra', 'mta_tax','tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount']
    for col in columns:
        negat[col] = negat[col].apply(convert_to_positiv)
    return negat

def droping_index(i, index):
    i.drop(index=index, axis= 0, inplace=True)
    return i

def concatenation(i):
    df_2 = pd.concat([i, positive], axis=0)
    return df_2


[Index](#section0)

---
<a id="section"></a>
##### For any trips that have a fare amount but have a trip distance of 0, calculate the distance this way: (Fare amount - 2.5) / 2.5

In [15]:
def funcion_trip_distance(i):
    total = (i - 2.5) / 2.5
    return total 

[Index](#section0)

---
<a id="section9"></a>
#####  For any trips that have a trip distance but have a fare amount of 0, calculate the fare amount this way: 2.5 + (trip distance x 2.5)

In [16]:
def funcion_fare_amount(i):
    total = 2.5 + (i * 2.5)
    return total 

----

In [17]:
def big_control(i):
    print("### Let’s stick to trips that were NOT sent via “store and forward” ###")
    i = not_store_and_forw(i)
    print(f"I should get just a N and its values \n{i.store_and_fwd_flag.value_counts()}")
    print("----------------------------------------------")
    print("### I’m only interested in street-hailed trips paid by card or cash, with a standard rate ###")
    i = by_card_or_cash(i)
    print(f"I should get just the number 1 and 2 (card or cash)are the values of them: \n{i.payment_type.value_counts()}")
    print("----------------------------------------------")
    print(f"I should get just the number 1 and a value: \n{i.trip_type.value_counts()}")
    print("----------------------------------------------")
    print("### We can remove any trips with dates before 2017 or after 2020, along with any trips with pickups or drop-offs into unknown zones ###")
    i = before_2017_after_2020(i)
    print(i["lpep_pickup_datetime"].min())
    print(i["lpep_pickup_datetime"].max())
    print("----------------------------------------------")
    print(f"I shouldn't 0-9: \n \n{i.passenger_count.value_counts()}")
    print("### Let’s assume any trips with no recorded passengers had 1 passenger ###")
    i = trips_no_recorded_passengers(i)
    print(f"I shouldn't get a 0: \n \n{i.passenger_count.value_counts()}")
    print("----------------------------------------------")
    print("### If a pickup date/time is AFTER the drop-off date/time, let’s swap them ###")
    print(i.shape[1])
    i = missing_column(i)
    print("----------------------------------------------")
    print(f"I should had 19 columns if I was missing a column now I should get a new one \n{i.shape[1]}")
    print("----------------------------------------------")
    i = drop_off_date_time(i)
    print("### We can remove trips lasting longer than a day, and any trips which show both a distance and fare amount of 0 ###")
    i = dist_and_fare_amount_of_0(i)
    print(i.shape)
    i = remove_trips_longer_than_a_day(i)
    print(f"I just removed trips longer than a day, \n{i.shape}")
    print("---------------FIN----------------------------")
    return i

In [18]:
def second_big_function(i):    
    i = droping_index(i, index)
    i = concatenation(i)
    print(f"I just removed any trips which show both a distance and fare amount of 0: \n{i.shape}")
    mask = i.fare_amount == 0
    mask1 = i.trip_distance == 0
    mask2 = i.fare_amount != 0
    mask3 = i.trip_distance != 0
    df_fare_amount = i[mask]
    df_trip_distance = i[mask1]
    df_3 = i[mask2 & mask3]
    df_fare_amount.fare_amount = df_fare_amount.trip_distance.apply(funcion_fare_amount)
    df_trip_distance.trip_distance = df_trip_distance.fare_amount.apply(funcion_trip_distance)
    i = pd.concat([df_fare_amount, df_trip_distance, df_3], axis =0)
    print("---------------FIN-2--------------------------")
    return i

---

In [19]:
df = df_2017.copy()
df_2 = big_control(df)
negat = looking_for_negative(df_2)
positive = changing_to_pos(negat)
index= positive.index
df_17 = second_big_function(df_2)
df_17

### Let’s stick to trips that were NOT sent via “store and forward” ###
I should get just a N and its values 
N    11723594
Name: store_and_fwd_flag, dtype: int64
----------------------------------------------
### I’m only interested in street-hailed trips paid by card or cash, with a standard rate ###
I should get just the number 1 and 2 (card or cash)are the values of them: 
1    5797124
2    5586535
Name: payment_type, dtype: int64
----------------------------------------------
I should get just the number 1 and a value: 
1.0    11383659
Name: trip_type, dtype: int64
----------------------------------------------
### We can remove any trips with dates before 2017 or after 2020, along with any trips with pickups or drop-offs into unknown zones ###
2017-01-01 00:00:08.000
2018-03-26 16:00:22.000
----------------------------------------------
I shouldn't 0-9: 
 
1    9594517
2     899499
5     410667
3     209514
6     202167
4      67101
0        124
9          3
Name: passenger_count

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  negat[col] = negat[col].apply(convert_to_positiv)


I just removed any trips which show both a distance and fare amount of 0: 
(11213732, 19)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


---------------FIN-2--------------------------


Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
490516,1,2017-01-03 18:26:58,2017-01-03 18:32:31,N,1,166,42,1,0.50,3.750,0.0,0.0,0.0,0.0,0.0,0.0,1,1.0,0
490519,1,2017-01-04 15:47:39,2017-01-04 16:00:44,N,1,97,17,1,1.70,6.750,0.0,0.0,0.0,0.0,0.0,0.0,1,1.0,0
490523,2,2017-01-06 09:57:47,2017-01-06 09:59:02,N,1,7,193,1,0.01,2.525,0.0,0.0,0.0,0.0,0.0,0.0,1,1.0,0
490524,1,2017-01-08 16:13:17,2017-01-08 16:49:35,N,1,25,54,1,4.80,14.500,0.0,0.0,0.0,0.0,0.0,0.0,1,1.0,0
490526,1,2017-01-09 09:56:04,2017-01-09 09:57:15,N,1,116,116,1,0.20,3.000,0.0,0.0,0.0,0.0,0.0,0.0,1,1.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11264206,2,2017-10-16 10:46:32,2017-10-16 10:49:18,N,1,41,74,2,0.37,4.000,0.0,0.5,0.0,0.0,0.3,4.8,2,1.0,0
11264210,2,2017-10-18 12:27:30,2017-10-18 12:28:04,N,1,75,75,2,0.05,2.500,0.0,0.5,0.0,0.0,0.3,3.3,2,1.0,0
11264243,2,2017-10-26 14:24:45,2017-10-26 14:24:58,N,1,75,75,4,0.08,2.500,0.0,0.5,0.0,0.0,0.3,3.3,2,1.0,0
11264254,2,2017-11-01 13:10:31,2017-11-01 13:17:12,N,1,65,33,2,0.80,6.000,0.0,0.5,0.0,0.0,0.3,6.8,2,1.0,0


In [20]:
df = df_2018.copy()
df_2 = big_control(df)
negat = looking_for_negative(df_2)
positive = changing_to_pos(negat)
index= positive.index
df_18 = second_big_function(df_2)
df_18

### Let’s stick to trips that were NOT sent via “store and forward” ###
I should get just a N and its values 
N    8790612
Name: store_and_fwd_flag, dtype: int64
----------------------------------------------
### I’m only interested in street-hailed trips paid by card or cash, with a standard rate ###
I should get just the number 1 and 2 (card or cash)are the values of them: 
1    4821805
2    3636122
Name: payment_type, dtype: int64
----------------------------------------------
I should get just the number 1 and a value: 
1.0    8457927
Name: trip_type, dtype: int64
----------------------------------------------
### We can remove any trips with dates before 2017 or after 2020, along with any trips with pickups or drop-offs into unknown zones ###
2017-12-31 01:39:20.000
2020-11-14 11:38:07.000
----------------------------------------------
I shouldn't 0-9: 
 
1    7155995
2     656167
5     294606
6     155789
3     135507
4      48780
0      10637
Name: passenger_count, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  negat[col] = negat[col].apply(convert_to_positiv)


I just removed any trips which show both a distance and fare amount of 0: 
(8347246, 19)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


---------------FIN-2--------------------------


Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
179348,1,2018-06-03 11:52:44,2018-06-03 11:53:18,N,1,74,74,1,3.30,10.75,0.0,0.5,0.0,0.0,0.3,0.8,2,1.0,0
528856,1,2018-05-02 20:25:43,2018-05-02 20:25:58,N,1,41,41,1,0.90,4.75,0.0,0.0,0.0,0.0,0.0,0.0,1,1.0,0
528857,1,2018-05-06 15:56:47,2018-05-06 16:07:12,N,1,226,112,1,2.50,8.75,0.0,0.0,0.0,0.0,0.0,0.0,1,1.0,0
528858,1,2018-05-09 16:10:15,2018-05-09 16:17:12,N,1,97,97,1,1.00,5.00,0.0,0.0,0.0,0.0,0.0,0.0,1,1.0,0
528859,1,2018-05-19 01:39:39,2018-05-19 01:45:08,N,1,49,62,1,1.30,5.75,0.0,0.0,0.0,0.0,0.0,0.0,1,1.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7934377,2,2018-09-26 10:57:38,2018-09-26 10:57:53,N,1,119,119,2,0.07,2.50,0.0,0.5,0.0,0.0,0.3,3.3,2,1.0,0
8531142,2,2018-12-21 09:06:14,2018-12-21 09:10:35,N,1,82,260,1,0.58,5.00,0.0,0.5,0.0,0.0,0.3,5.8,2,1.0,0
8531159,2,2018-12-24 14:14:45,2018-12-24 14:16:58,N,1,236,236,1,0.64,4.00,0.0,0.5,0.0,0.0,0.3,4.8,2,1.0,0
8626793,2,2018-12-26 21:18:41,2018-12-26 21:20:02,N,1,74,74,5,0.13,3.00,0.5,0.5,0.0,0.0,0.3,4.3,2,1.0,0


In [21]:
df = df_2019.copy()
df_2 = big_control(df)
negat = looking_for_negative(df_2)
positive = changing_to_pos(negat)
index= positive.index
df_19 = second_big_function(df_2)
df_19

### Let’s stick to trips that were NOT sent via “store and forward” ###
I should get just a N and its values 
N    5615484
Name: store_and_fwd_flag, dtype: int64
----------------------------------------------
### I’m only interested in street-hailed trips paid by card or cash, with a standard rate ###
I should get just the number 1 and 2 (card or cash)are the values of them: 
1.0    3003779
2.0    2292346
Name: payment_type, dtype: int64
----------------------------------------------
I should get just the number 1 and a value: 
1.0    5296125
Name: trip_type, dtype: int64
----------------------------------------------
### We can remove any trips with dates before 2017 or after 2020, along with any trips with pickups or drop-offs into unknown zones ###
2018-03-07 14:27:32.000
2020-04-22 20:19:55.000
----------------------------------------------
I shouldn't 0-9: 
 
1.0    4544881
2.0     394065
5.0     161333
6.0      87304
3.0      71812
4.0      26764
0.0       9816
Name: passenger_co

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  negat[col] = negat[col].apply(convert_to_positiv)


I just removed any trips which show both a distance and fare amount of 0: 
(5231418, 19)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


---------------FIN-2--------------------------


Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
175772,1.0,2019-05-30 09:32:33,2019-05-30 10:39:12,N,1.0,80,107,1.0,5.00,15.000,0.0,0.5,0.0,0.0,0.3,0.8,1.0,1.0,0.0
607136,2.0,2019-04-18 20:55:23,2019-04-18 20:59:25,N,1.0,193,145,1.0,1.00,5.000,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
607146,2.0,2019-04-23 23:25:28,2019-04-23 23:27:52,N,1.0,7,7,1.0,0.01,2.525,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
607150,2.0,2019-04-26 12:26:28,2019-04-26 14:59:28,N,1.0,193,186,1.0,6.84,19.600,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
607151,2.0,2019-05-06 09:16:56,2019-05-06 09:39:39,N,1.0,74,136,1.0,5.74,16.850,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5547153,2.0,2019-01-16 13:23:22,2019-01-16 13:24:03,N,1.0,25,33,1.0,0.03,2.500,0.0,0.5,0.0,0.0,0.3,3.3,2.0,1.0,
5547154,2.0,2019-02-15 14:54:58,2019-02-15 14:55:12,N,1.0,166,166,1.0,0.08,2.500,0.0,0.5,0.0,0.0,0.3,3.3,2.0,1.0,0.0
5562704,2.0,2019-02-15 08:56:13,2019-02-15 08:57:44,N,1.0,145,145,2.0,0.21,3.000,0.0,0.5,0.0,0.0,0.3,3.8,2.0,1.0,0.0
5562737,2.0,2019-02-20 10:40:55,2019-02-20 10:45:02,N,1.0,145,145,2.0,0.78,5.000,0.0,0.5,0.0,0.0,0.3,5.8,2.0,1.0,0.0


In [22]:
df = df_2020.copy()
df_2 = big_control(df)
negat = looking_for_negative(df_2)
positive = changing_to_pos(negat)
index= positive.index
df_20 = second_big_function(df_2)
df_20

### Let’s stick to trips that were NOT sent via “store and forward” ###
I should get just a N and its values 
N    1201260
Name: store_and_fwd_flag, dtype: int64
----------------------------------------------
### I’m only interested in street-hailed trips paid by card or cash, with a standard rate ###
I should get just the number 1 and 2 (card or cash)are the values of them: 
1.0    643315
2.0    510395
Name: payment_type, dtype: int64
----------------------------------------------
I should get just the number 1 and a value: 
1.0    1153710
Name: trip_type, dtype: int64
----------------------------------------------
### We can remove any trips with dates before 2017 or after 2020, along with any trips with pickups or drop-offs into unknown zones ###
2019-12-18 15:52:30.000
2020-12-31 23:59:53.000
----------------------------------------------
I shouldn't 0-9: 
 
1.0    1005021
2.0      76151
5.0      29813
6.0      20326
3.0      14703
4.0       5066
0.0       2596
7.0          1
Name:

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  negat[col] = negat[col].apply(convert_to_positiv)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
316882,2.0,2020-01-03 19:11:01,2020-01-03 19:11:40,N,1.0,42,42,1.0,0.02,2.550,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0
316883,2.0,2020-01-04 13:28:58,2020-01-04 13:51:04,N,1.0,41,170,1.0,4.67,14.175,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0
316884,2.0,2020-01-04 22:29:52,2020-01-04 22:32:28,N,1.0,82,82,1.0,0.01,2.525,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0
316885,2.0,2020-01-04 23:36:48,2020-01-04 23:48:12,N,1.0,181,45,1.0,2.87,9.675,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0
316886,2.0,2020-01-05 19:34:47,2020-01-05 19:52:00,N,1.0,75,48,1.0,4.18,12.950,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1730034,2.0,2020-12-11 10:55:38,2020-12-11 11:05:31,N,1.0,244,220,1.0,4.06,13.500,0.0,0.5,0.0,2.8,0.3,17.1,2.0,1.0,0.0
1730035,2.0,2020-12-16 14:35:14,2020-12-16 14:48:01,N,1.0,244,220,1.0,3.27,13.000,0.0,0.5,0.0,2.8,0.3,16.6,2.0,1.0,0.0
1730036,2.0,2020-12-16 14:03:48,2020-12-16 14:13:28,N,1.0,244,220,1.0,3.87,13.000,0.0,0.5,0.0,2.8,0.3,16.6,2.0,1.0,0.0
1730037,2.0,2020-12-22 15:51:20,2020-12-22 16:00:48,N,1.0,244,220,1.0,3.50,13.000,0.0,0.5,0.0,2.8,0.3,16.6,2.0,1.0,0.0


In [23]:
df_17.to_csv(f'../data/df_17.csv', index=False)

In [24]:
df_18.to_csv(f'../data/df_18.csv', index=False)

In [25]:
df_19.to_csv(f'../data/df_19.csv', index=False)

In [26]:
df_20.to_csv(f'../data/df_20.csv', index=False)