## Performing exploratory data analysis in Python

Before you can clean your data, you have to know what your data looks like. The process of understanding your data is called exploratory data analysis (EDA)

In [1]:
#Import the necessary libraries
import pandas as pd

In [2]:
#read the csv file
df=pd.read_csv("scooter.csv")
#With the data in a DataFrame, you can now explore it, and then analyze it.

#### Exploring the data

In [3]:
#look at the columns and the data types 
df.columns

Index(['month', 'trip_id', 'region_id', 'vehicle_id', 'started_at', 'ended_at',
       'DURATION', 'start_location_name', 'end_location_name', 'user_id',
       'trip_ledger_id'],
      dtype='object')

In [4]:
df.dtypes

month                  object
trip_id                 int64
region_id               int64
vehicle_id              int64
started_at             object
ended_at               object
DURATION               object
start_location_name    object
end_location_name      object
user_id                 int64
trip_ledger_id          int64
dtype: object

In [5]:
#Print the first five rows of the column
df.head()

Unnamed: 0,month,trip_id,region_id,vehicle_id,started_at,ended_at,DURATION,start_location_name,end_location_name,user_id,trip_ledger_id
0,May,1613335,202,9424537,5/21/2019 18:33,5/21/2019 18:40,0:07:03,"1901 Roma Ave NE, Albuquerque, NM 87106, USA","1899 Roma Ave NE, Albuquerque, NM 87106, USA",8417864,1488546
1,May,1613639,202,9424537,5/21/2019 19:07,5/21/2019 19:12,0:04:57,"1 Domenici Center en Domenici Center, Albuquer...","1111 Stanford Dr NE, Albuquerque, NM 87106, USA",8417864,1488838
2,May,1613708,202,9424537,5/21/2019 19:13,5/21/2019 19:15,0:01:14,"1 Domenici Center en Domenici Center, Albuquer...","1 Domenici Center en Domenici Center, Albuquer...",8417864,1488851
3,May,1613867,202,9424537,5/21/2019 19:29,5/21/2019 19:36,0:06:58,"Rotunda at Science & Technology Park, 801 Univ...","725 University Blvd SE, Albuquerque, NM 87106,...",8417864,1489064
4,May,1636714,202,8926493,5/24/2019 13:38,5/24/2019 13:41,0:03:06,"401 2nd St NW, Albuquerque, NM 87102, USA","401 2nd St NW, Albuquerque, NM 87102, USA",35436274,1511212


In [7]:
#To display all the columns, you can change the number of columns to show using the set_options method:
pd.set_option('display.max_columns', 500)


In [8]:
#Displaying a single column
df["DURATION"]

0        0:07:03
1        0:04:57
2        0:01:14
3        0:06:58
4        0:03:06
          ...   
34221    0:14:00
34222    0:08:00
34223    1:53:00
34224    0:12:00
34225    1:51:00
Name: DURATION, Length: 34226, dtype: object

In [9]:
#display a list of columns using double [], as shown in the following code block:
df[['trip_id','DURATION','start_location_name']] 

Unnamed: 0,trip_id,DURATION,start_location_name
0,1613335,0:07:03,"1901 Roma Ave NE, Albuquerque, NM 87106, USA"
1,1613639,0:04:57,"1 Domenici Center en Domenici Center, Albuquer..."
2,1613708,0:01:14,"1 Domenici Center en Domenici Center, Albuquer..."
3,1613867,0:06:58,"Rotunda at Science & Technology Park, 801 Univ..."
4,1636714,0:03:06,"401 2nd St NW, Albuquerque, NM 87102, USA"
...,...,...,...
34221,2482235,0:14:00,"Central @ Broadway, Albuquerque, NM 87102, USA"
34222,2482254,0:08:00,"224 Central Ave SW, Albuquerque, NM 87102, USA"
34223,2482257,1:53:00,"105 Stanford Dr SE, Albuquerque, NM 87106, USA"
34224,2482275,0:12:00,"100 Broadway Blvd SE, Albuquerque, NM 87102, USA"


You can also pull a sample from your data using sample(). The sample methods allow you to specify how many rows you would like to pull. The results are shown in the following code block:

In [10]:
df.sample(5)

Unnamed: 0,month,trip_id,region_id,vehicle_id,started_at,ended_at,DURATION,start_location_name,end_location_name,user_id,trip_ledger_id
3666,May,1726286,202,6443170,5/31/2019 2:54,5/31/2019 3:04,0:10:15,"506 3rd St SW, Albuquerque, NM 87102, USA","123 Central Ave NW, Albuquerque, NM 87102, USA",35613651,1597729
20158,June,2035248,202,2584557,6/23/2019 2:17,6/23/2019 2:36,0:18:17,"3645 Central Ave NW, Albuquerque, NM 87104, USA","2700 Central Ave SW, Albuquerque, NM 87104, USA",42259943,1899420
33694,July,2465769,202,8703340,7/21/2019 1:46,7/21/2019 2:00,0:14:00,"622 Central Ave NW, Albuquerque, NM 87102, USA","901 Park Ave SW, Albuquerque, NM 87102, USA",38318605,2323913
17057,June,1962210,202,6557397,6/17/2019 2:06,6/17/2019 2:52,0:46:28,"2410 Central Ave SE, Albuquerque, NM 87106, USA","2300 Central Ave SE, Albuquerque, NM 87106, USA",41732290,1827635
18861,June,2004672,202,6243714,6/21/2019 1:42,6/21/2019 2:07,0:25:12,"2525 Tingley Dr SW, Albuquerque, NM 87104, USA","Central @ New York, Albuquerque, NM 87104, USA",42232004,1869485


#### Slicing Data

Slicing takes the format of [start:end], where a blank is the first or last row depending on which position is blank. To slice the first 10 rows, you can use the following notation: 

In [11]:
df[:10] 

Unnamed: 0,month,trip_id,region_id,vehicle_id,started_at,ended_at,DURATION,start_location_name,end_location_name,user_id,trip_ledger_id
0,May,1613335,202,9424537,5/21/2019 18:33,5/21/2019 18:40,0:07:03,"1901 Roma Ave NE, Albuquerque, NM 87106, USA","1899 Roma Ave NE, Albuquerque, NM 87106, USA",8417864,1488546
1,May,1613639,202,9424537,5/21/2019 19:07,5/21/2019 19:12,0:04:57,"1 Domenici Center en Domenici Center, Albuquer...","1111 Stanford Dr NE, Albuquerque, NM 87106, USA",8417864,1488838
2,May,1613708,202,9424537,5/21/2019 19:13,5/21/2019 19:15,0:01:14,"1 Domenici Center en Domenici Center, Albuquer...","1 Domenici Center en Domenici Center, Albuquer...",8417864,1488851
3,May,1613867,202,9424537,5/21/2019 19:29,5/21/2019 19:36,0:06:58,"Rotunda at Science & Technology Park, 801 Univ...","725 University Blvd SE, Albuquerque, NM 87106,...",8417864,1489064
4,May,1636714,202,8926493,5/24/2019 13:38,5/24/2019 13:41,0:03:06,"401 2nd St NW, Albuquerque, NM 87102, USA","401 2nd St NW, Albuquerque, NM 87102, USA",35436274,1511212
5,May,1636780,202,3902020,5/24/2019 13:52,5/24/2019 14:13,0:21:27,"520 Central Ave SW, Albuquerque, NM 87102, USA","3217 Pershing Ave SE, Albuquerque, NM 87106, USA",34352757,1511371
6,May,1636856,202,5192526,5/24/2019 14:04,5/24/2019 14:34,0:30:43,"330 Tijeras Ave NW, Albuquerque, NM 87102, USA","809 Copper Ave NW, Albuquerque, NM 87102, USA",35466666,1511483
7,May,1636912,202,3902020,5/24/2019 14:14,5/24/2019 14:15,0:01:05,"3217 Pershing Ave SE, Albuquerque, NM 87106, USA","802 Wellesley Dr SE, Albuquerque, NM 87106, USA",34352757,1511390
8,May,1637035,202,5192526,5/24/2019 14:37,5/24/2019 14:43,0:05:33,"809 Copper Ave NW, Albuquerque, NM 87102, USA","809 Copper Ave NW, Albuquerque, NM 87102, USA",35466666,1511516
9,May,1637036,202,9885684,5/24/2019 14:38,5/24/2019 15:09,0:31:54,"2901 Central Ave NE, Albuquerque, NM 87106, USA","2000 Zearing Ave NW, Albuquerque, NM 87104, USA",34352757,1511666


In [12]:
#Likewise, to grab the rows from 10 to the end (34,225), you can use the following notation
df[10:]

Unnamed: 0,month,trip_id,region_id,vehicle_id,started_at,ended_at,DURATION,start_location_name,end_location_name,user_id,trip_ledger_id
10,May,1637075,202,5192526,5/24/2019 14:45,5/24/2019 14:48,0:03:34,"809 Copper Ave NW, Albuquerque, NM 87102, USA","809 Copper Ave NW, Albuquerque, NM 87102, USA",35466666,1511544
11,May,1637120,202,5192526,5/24/2019 14:52,5/24/2019 14:58,0:05:40,"809 Copper Ave NW, Albuquerque, NM 87102, USA","809 Copper Ave NW, Albuquerque, NM 87102, USA",35466666,1511610
12,May,1637212,202,3203954,5/24/2019 15:05,5/24/2019 15:21,0:16:37,"423 Central Ave NE, Albuquerque, NM 87102, USA","512 Central Ave SE, Albuquerque, NM 87102, USA",35490582,1511765
13,May,1637299,202,3027855,5/24/2019 15:16,5/24/2019 15:36,0:20:12,"1898 Mountain Rd NW, Albuquerque, NM 87104, USA","824 Stover Ave SW, Albuquerque, NM 87102, USA",34352757,1511876
14,May,1637378,202,4335950,5/24/2019 15:27,5/24/2019 15:31,0:03:51,"400 Marquette Ave NW, Albuquerque, NM 87102, USA","5th @ Marquette, Albuquerque, NM 87102, USA",35493806,1511834
...,...,...,...,...,...,...,...,...,...,...,...
34221,July,2482235,202,2893981,7/21/2019 23:51,7/22/2019 0:05,0:14:00,"Central @ Broadway, Albuquerque, NM 87102, USA","1418 4th St NW, Albuquerque, NM 87102, USA",42559731,2340035
34222,July,2482254,202,8201542,7/21/2019 23:52,7/22/2019 0:00,0:08:00,"224 Central Ave SW, Albuquerque, NM 87102, USA","302 San Felipe St NW, Albuquerque, NM 87104, USA",42457674,2339885
34223,July,2482257,202,5136810,7/21/2019 23:52,7/22/2019 1:45,1:53:00,"105 Stanford Dr SE, Albuquerque, NM 87106, USA","3339 Central Ave NE, Albuquerque, NM 87106, USA",42576631,2342126
34224,July,2482275,202,3125962,7/21/2019 23:53,7/22/2019 0:05,0:12:00,"100 Broadway Blvd SE, Albuquerque, NM 87102, USA","1413 4th St SW, Albuquerque, NM 87102, USA",42575656,2340036


In [13]:
#You can also slice the frame starting on the third row and ending before nine, as shown in the following code block
df[3:9]

Unnamed: 0,month,trip_id,region_id,vehicle_id,started_at,ended_at,DURATION,start_location_name,end_location_name,user_id,trip_ledger_id
3,May,1613867,202,9424537,5/21/2019 19:29,5/21/2019 19:36,0:06:58,"Rotunda at Science & Technology Park, 801 Univ...","725 University Blvd SE, Albuquerque, NM 87106,...",8417864,1489064
4,May,1636714,202,8926493,5/24/2019 13:38,5/24/2019 13:41,0:03:06,"401 2nd St NW, Albuquerque, NM 87102, USA","401 2nd St NW, Albuquerque, NM 87102, USA",35436274,1511212
5,May,1636780,202,3902020,5/24/2019 13:52,5/24/2019 14:13,0:21:27,"520 Central Ave SW, Albuquerque, NM 87102, USA","3217 Pershing Ave SE, Albuquerque, NM 87106, USA",34352757,1511371
6,May,1636856,202,5192526,5/24/2019 14:04,5/24/2019 14:34,0:30:43,"330 Tijeras Ave NW, Albuquerque, NM 87102, USA","809 Copper Ave NW, Albuquerque, NM 87102, USA",35466666,1511483
7,May,1636912,202,3902020,5/24/2019 14:14,5/24/2019 14:15,0:01:05,"3217 Pershing Ave SE, Albuquerque, NM 87106, USA","802 Wellesley Dr SE, Albuquerque, NM 87106, USA",34352757,1511390
8,May,1637035,202,5192526,5/24/2019 14:37,5/24/2019 14:43,0:05:33,"809 Copper Ave NW, Albuquerque, NM 87102, USA","809 Copper Ave NW, Albuquerque, NM 87102, USA",35466666,1511516


 Sometimes, you know the exact row you want, and instead of slicing it, you can select it using loc(). The loc method takes the index name, which, in this example, is an integer.

In [14]:
df.loc[34221]

month                                                            July
trip_id                                                       2482235
region_id                                                         202
vehicle_id                                                    2893981
started_at                                            7/21/2019 23:51
ended_at                                               7/22/2019 0:05
DURATION                                                      0:14:00
start_location_name    Central @ Broadway, Albuquerque, NM 87102, USA
end_location_name          1418 4th St NW, Albuquerque, NM 87102, USA
user_id                                                      42559731
trip_ledger_id                                                2340035
Name: 34221, dtype: object

 Using at(), with the position, as you did in the slicing examples, and a column name,you can select a single value. For example, this can be done to know the duration of the trip in the second row:

In [15]:
df.at[2,'DURATION']

'0:01:14'

In [16]:
#Using the where method, you can pass a condition, as shown in the following code block
user=df.where(df['user_id']==8417864)
user

Unnamed: 0,month,trip_id,region_id,vehicle_id,started_at,ended_at,DURATION,start_location_name,end_location_name,user_id,trip_ledger_id
0,May,1613335.0,202.0,9424537.0,5/21/2019 18:33,5/21/2019 18:40,0:07:03,"1901 Roma Ave NE, Albuquerque, NM 87106, USA","1899 Roma Ave NE, Albuquerque, NM 87106, USA",8417864.0,1488546.0
1,May,1613639.0,202.0,9424537.0,5/21/2019 19:07,5/21/2019 19:12,0:04:57,"1 Domenici Center en Domenici Center, Albuquer...","1111 Stanford Dr NE, Albuquerque, NM 87106, USA",8417864.0,1488838.0
2,May,1613708.0,202.0,9424537.0,5/21/2019 19:13,5/21/2019 19:15,0:01:14,"1 Domenici Center en Domenici Center, Albuquer...","1 Domenici Center en Domenici Center, Albuquer...",8417864.0,1488851.0
3,May,1613867.0,202.0,9424537.0,5/21/2019 19:29,5/21/2019 19:36,0:06:58,"Rotunda at Science & Technology Park, 801 Univ...","725 University Blvd SE, Albuquerque, NM 87106,...",8417864.0,1489064.0
4,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
34221,,,,,,,,,,,
34222,,,,,,,,,,,
34223,,,,,,,,,,,
34224,,,,,,,,,,,


The preceding code and results show the results of where with the condition of the user
ID being equal to 8417864. The results replace values that do not meet the criteria as
NaN. 

In [17]:
#This line of code would not include NaN
df[(df['user_id']==8417864)]

Unnamed: 0,month,trip_id,region_id,vehicle_id,started_at,ended_at,DURATION,start_location_name,end_location_name,user_id,trip_ledger_id
0,May,1613335,202,9424537,5/21/2019 18:33,5/21/2019 18:40,0:07:03,"1901 Roma Ave NE, Albuquerque, NM 87106, USA","1899 Roma Ave NE, Albuquerque, NM 87106, USA",8417864,1488546
1,May,1613639,202,9424537,5/21/2019 19:07,5/21/2019 19:12,0:04:57,"1 Domenici Center en Domenici Center, Albuquer...","1111 Stanford Dr NE, Albuquerque, NM 87106, USA",8417864,1488838
2,May,1613708,202,9424537,5/21/2019 19:13,5/21/2019 19:15,0:01:14,"1 Domenici Center en Domenici Center, Albuquer...","1 Domenici Center en Domenici Center, Albuquer...",8417864,1488851
3,May,1613867,202,9424537,5/21/2019 19:29,5/21/2019 19:36,0:06:58,"Rotunda at Science & Technology Park, 801 Univ...","725 University Blvd SE, Albuquerque, NM 87106,...",8417864,1489064
1256,May,1681624,202,8192169,5/27/2019 1:11,5/27/2019 1:11,0:00:12,"2513 Comanche Rd NE, Albuquerque, NM 87107, USA","2513 Comanche Rd NE, Albuquerque, NM 87107, USA",8417864,1554217


In [18]:
#Using both notations, you can combine conditional statements. By using the same user ID condition, you can add a trip ID condition
one=df['user_id']==8417864
two=df['trip_ledger_id']==1488838
df.where(one & two)

Unnamed: 0,month,trip_id,region_id,vehicle_id,started_at,ended_at,DURATION,start_location_name,end_location_name,user_id,trip_ledger_id
0,,,,,,,,,,,
1,May,1613639.0,202.0,9424537.0,5/21/2019 19:07,5/21/2019 19:12,0:04:57,"1 Domenici Center en Domenici Center, Albuquer...","1111 Stanford Dr NE, Albuquerque, NM 87106, USA",8417864.0,1488838.0
2,,,,,,,,,,,
3,,,,,,,,,,,
4,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
34221,,,,,,,,,,,
34222,,,,,,,,,,,
34223,,,,,,,,,,,
34224,,,,,,,,,,,


In [19]:
#Using the second notation, the output is as follows
df[(one)&(two)]    

Unnamed: 0,month,trip_id,region_id,vehicle_id,started_at,ended_at,DURATION,start_location_name,end_location_name,user_id,trip_ledger_id
1,May,1613639,202,9424537,5/21/2019 19:07,5/21/2019 19:12,0:04:57,"1 Domenici Center en Domenici Center, Albuquer...","1111 Stanford Dr NE, Albuquerque, NM 87106, USA",8417864,1488838


### Analyzing the data

By using the describe
method, you can see a series of statistics pertaining to your data. In statistics, there is
a set of statistics referred to as the five-number summary, and describe() is a varian

In [20]:
df.describe()

Unnamed: 0,trip_id,region_id,vehicle_id,user_id,trip_ledger_id
count,34226.0,34226.0,34226.0,34226.0,34226.0
mean,2004438.0,202.0,5589507.0,38754200.0,1869549.0
std,230047.6,0.0,2627164.0,4275441.0,225263.9
min,1613335.0,202.0,1034847.0,10802.0,1488546.0
25%,1813521.0,202.0,3260435.0,36657100.0,1683023.0
50%,1962520.0,202.0,5617097.0,38807500.0,1827796.0
75%,2182324.0,202.0,8012871.0,42227740.0,2042524.0
max,2482335.0,202.0,9984848.0,42587320.0,2342161.0


In [21]:
#Using describe() on a single column is sometimes more helpful
df['start_location_name'].describe()

count                                               34220
unique                                               2972
top       1898 Mountain Rd NW, Albuquerque, NM 87104, USA
freq                                                 1210
Name: start_location_name, dtype: object

The data is not numeric, so we get a different set of statistics, but these provide some
insight. Of the 34220 starting locations, there are actually 2972 unique locations. The
top location (1898 Mountain Rd NW) accounts for 1210 trip starting locations. Later,
you will geocode this data — add coordinates to the address — and knowing the unique
values means you only have to geocode those 2,972 and not the full 34,220

In [22]:
#The value_counts method will give you the value and count for all unique values.
#We need to call it to a single column
df['DURATION'].value_counts()

0:04:00     825
0:03:00     807
0:05:00     728
0:06:00     649
0:07:00     627
           ... 
0:40:15       1
39:24:42      1
0:43:09       1
1:05:10       1
1:12:07       1
Name: DURATION, Length: 4135, dtype: int64

From this method, you can see that 0:04:00 is at the top with a frequency of 825 —
which you could have found out with describe() — but you can also see the frequency
of all the other values. To see the frequency as a percentage, you can pass the normalize
parameter (which is False by default)

In [23]:
df['DURATION'].value_counts(normalize=True)

0:04:00     0.025847
0:03:00     0.025284
0:05:00     0.022808
0:06:00     0.020333
0:07:00     0.019644
              ...   
0:40:15     0.000031
39:24:42    0.000031
0:43:09     0.000031
1:05:10     0.000031
1:12:07     0.000031
Name: DURATION, Length: 4135, dtype: float64

You will notice that no single value makes up a significant percentage of the duration
You can also pass the dropna parameter. By default, value_counts() sets it to True
and you will not see them. Setting it to False, you can see that end_location_name
is missing 2070 entries

In [24]:
df['end_location_name'].value_counts(dropna=False)

NaN                                                  2070
1898 Mountain Rd NW, Albuquerque, NM 87104, USA       802
Central @ Tingley, Albuquerque, NM 87104, USA         622
330 Tijeras Ave NW, Albuquerque, NM 87102, USA        529
2550 Central Ave NE, Albuquerque, NM 87106, USA       478
                                                     ... 
116 Washington St SE, Albuquerque, NM 87108, USA        1
716 Commercial St SE, Albuquerque, NM 87102, USA        1
119 Walter St SE, Albuquerque, NM 87102, USA            1
3801 Carlisle Blvd NE, Albuquerque, NM 87107, USA       1
1413 4th St SW, Albuquerque, NM 87102, USA              1
Name: end_location_name, Length: 4264, dtype: int64

In [25]:
#The best way to find out how many missing values you have in your columns is to use the isnull() method
df.isnull().sum()

month                     0
trip_id                   0
region_id                 0
vehicle_id                0
started_at                0
ended_at                  0
DURATION               2308
start_location_name       6
end_location_name      2070
user_id                   0
trip_ledger_id            0
dtype: int64

Another parameter of value_counts() is bins. The scooter dataset does not have
a good column for this, but using a numeric column, you would get results like the
following

In [26]:
df['trip_id'].value_counts(bins=10)

(1787135.0, 1874035.0]      5561
(1700235.0, 1787135.0]      4900
(1874035.0, 1960935.0]      4316
(1960935.0, 2047835.0]      3922
(2047835.0, 2134735.0]      3296
(2221635.0, 2308535.0]      2876
(2308535.0, 2395435.0]      2515
(2134735.0, 2221635.0]      2490
(2395435.0, 2482335.0]      2228
(1612465.999, 1700235.0]    2122
Name: trip_id, dtype: int64

## Handling common data issues using pandas 

#### Drop rows and columns

In [27]:
#You can drop columns using the drop method. The method will allow you to specify
#whether to drop a row or a column. Rows are the default, so we will specify columns,
#as shown in the following code block
df.drop(columns=['region_id'], inplace=True)
#Specifying the columns to drop, you also need to add inplace to make it modify the original DataFrame.

In [28]:
#To drop a row, you only need to specify index instead of columns. To drop the row with the index of 34225
df.drop(index=[34225],inplace=True)

In [29]:
#Looking at the e-scooter data, there are six rows with no start location name:
df['start_location_name'][(df['start_location_name'].isnull())]

26042    NaN
26044    NaN
26046    NaN
26048    NaN
26051    NaN
26053    NaN
Name: start_location_name, dtype: object

To drop these rows, you can use dropna on axis=0 with how=any, which are the
defaults. This will, however, delete rows where other nulls exist, such as end_location_
name. So, you will need to specify the column name as a subset

In [30]:
df.dropna(subset=['start_location_name'],inplace=True)

In [31]:
#Then, when you select nulls in the start_location_name field as in the preceding code block, you will get an empty series
df['start_location_name'][(df['start_location_name'].isnull())]

Series([], Name: start_location_name, dtype: object)

Dropping an entire column based on missing values may only make sense if a certain
percentage of rows are null. For example, if more than 25% of the rows are null, you
may want to drop it. You could specify this in the threshold by using something like the
following code for the thresh parameter

In [32]:
thresh=int(len(df)*.25)

In [33]:
#You may want to fill them with a value. You can use fillna() to fill either null columns or rows
df.fillna(value='00:00:00',axis='columns')

Unnamed: 0,month,trip_id,vehicle_id,started_at,ended_at,DURATION,start_location_name,end_location_name,user_id,trip_ledger_id
0,May,1613335,9424537,5/21/2019 18:33,5/21/2019 18:40,0:07:03,"1901 Roma Ave NE, Albuquerque, NM 87106, USA","1899 Roma Ave NE, Albuquerque, NM 87106, USA",8417864,1488546
1,May,1613639,9424537,5/21/2019 19:07,5/21/2019 19:12,0:04:57,"1 Domenici Center en Domenici Center, Albuquer...","1111 Stanford Dr NE, Albuquerque, NM 87106, USA",8417864,1488838
2,May,1613708,9424537,5/21/2019 19:13,5/21/2019 19:15,0:01:14,"1 Domenici Center en Domenici Center, Albuquer...","1 Domenici Center en Domenici Center, Albuquer...",8417864,1488851
3,May,1613867,9424537,5/21/2019 19:29,5/21/2019 19:36,0:06:58,"Rotunda at Science & Technology Park, 801 Univ...","725 University Blvd SE, Albuquerque, NM 87106,...",8417864,1489064
4,May,1636714,8926493,5/24/2019 13:38,5/24/2019 13:41,0:03:06,"401 2nd St NW, Albuquerque, NM 87102, USA","401 2nd St NW, Albuquerque, NM 87102, USA",35436274,1511212
...,...,...,...,...,...,...,...,...,...,...
34220,July,2482224,7247079,7/21/2019 23:51,7/22/2019 1:46,1:55:00,"2500 Central Ave SE, Albuquerque, NM 87106, USA","3418 Central Ave SE, Albuquerque, NM 87106, USA",42587320,2342157
34221,July,2482235,2893981,7/21/2019 23:51,7/22/2019 0:05,0:14:00,"Central @ Broadway, Albuquerque, NM 87102, USA","1418 4th St NW, Albuquerque, NM 87102, USA",42559731,2340035
34222,July,2482254,8201542,7/21/2019 23:52,7/22/2019 0:00,0:08:00,"224 Central Ave SW, Albuquerque, NM 87102, USA","302 San Felipe St NW, Albuquerque, NM 87104, USA",42457674,2339885
34223,July,2482257,5136810,7/21/2019 23:52,7/22/2019 1:45,1:53:00,"105 Stanford Dr SE, Albuquerque, NM 87106, USA","3339 Central Ave NE, Albuquerque, NM 87106, USA",42576631,2342126


In the following code, we will copy the rows where both the start and end location
are null. Then, we will create a value object that assigns a street name to the start_
location_name field and a different street address to the end_location_name field.
Using fillna(), we pass the value to the value parameter, and then print those two
columns in the DataFrame by showing the change

In [40]:
startstop=df[(df['start_location_name'].isnull())&(df['end_location_name'].isnull())]
value={'start_location_name':'Start St.','end_location_name':'Stop St.'}
startstop.fillna(value=value)
startstop[['start_location_name','end_location_name']]

Unnamed: 0,start_location_name,end_location_name


You can drop rows based on more advanced filters; for example, what if you want to
drop all the rows where the month was May? You could iterate through the DataFrame
and check the month, and then drop it if it is May. Or, a much better way would be to filter
out the rows, and then pass the index to the drop method. You can filter the DataFrame
and pass it to a new one, as shown in the following code block

In [41]:
may=df[(df['month']=='May')]
may

Unnamed: 0,month,trip_id,vehicle_id,started_at,ended_at,DURATION,start_location_name,end_location_name,user_id,trip_ledger_id
0,May,1613335,9424537,5/21/2019 18:33,5/21/2019 18:40,0:07:03,"1901 Roma Ave NE, Albuquerque, NM 87106, USA","1899 Roma Ave NE, Albuquerque, NM 87106, USA",8417864,1488546
1,May,1613639,9424537,5/21/2019 19:07,5/21/2019 19:12,0:04:57,"1 Domenici Center en Domenici Center, Albuquer...","1111 Stanford Dr NE, Albuquerque, NM 87106, USA",8417864,1488838
2,May,1613708,9424537,5/21/2019 19:13,5/21/2019 19:15,0:01:14,"1 Domenici Center en Domenici Center, Albuquer...","1 Domenici Center en Domenici Center, Albuquer...",8417864,1488851
3,May,1613867,9424537,5/21/2019 19:29,5/21/2019 19:36,0:06:58,"Rotunda at Science & Technology Park, 801 Univ...","725 University Blvd SE, Albuquerque, NM 87106,...",8417864,1489064
4,May,1636714,8926493,5/24/2019 13:38,5/24/2019 13:41,0:03:06,"401 2nd St NW, Albuquerque, NM 87102, USA","401 2nd St NW, Albuquerque, NM 87102, USA",35436274,1511212
...,...,...,...,...,...,...,...,...,...,...
4220,May,1737356,9974212,5/31/2019 23:58,6/1/2019 0:07,0:09:15,"415 Central Ave SW, Albuquerque, NM 87102, USA","415 Central Ave SW, Albuquerque, NM 87102, USA",35714580,1608429
4221,May,1737376,6563620,5/31/2019 23:59,5/31/2019 23:59,0:00:23,"524 Central Ave SW, Albuquerque, NM 87102, USA","600 Central Ave NW, Albuquerque, NM 87102, USA",37503537,1608261
4222,May,1737386,2741467,5/31/2019 23:59,6/1/2019 0:01,0:02:27,"601 Gold Ave SW, Albuquerque, NM 87102, USA","Gold Bldg, 320 Gold Ave SW, Albuquerque, NM 87...",37485128,1608314
4223,May,1737391,6563620,5/31/2019 23:59,6/1/2019 0:02,0:03:15,"524 Central Ave SW, Albuquerque, NM 87102, USA","524 Central Ave SW, Albuquerque, NM 87102, USA",37504521,1608337


In [42]:
#Then you can use drop() on the original DataFrame and pass the index for the rows in the may DataFrame
df.drop(index=may.index,inplace=True)

In [43]:
#Now, if you look at the months in the original DataFrame, you will see that May is missing
df['month'].value_counts()

June    20259
July     9735
Name: month, dtype: int64

Now that you have removed the rows and columns that you either do not need, or that
were unusable on account of missing data, it is time to format them.

## Creating and Modifying columns

In [1]:
#Import the necessary libraries
import pandas as pd

In [2]:
#read the csv file
df=pd.read_csv("scooter.csv")
#With the data in a DataFrame, you can now explore it, and then analyze it.

The following code will make all the columns lowercase

In [4]:
df.columns=[x.lower() for x in df.columns] 
print(df.columns)

Index(['month', 'trip_id', 'region_id', 'vehicle_id', 'started_at', 'ended_at',
       'duration', 'start_location_name', 'end_location_name', 'user_id',
       'trip_ledger_id'],
      dtype='object')


The preceding code says that for every item in df.columns,
make it lowercase, and assign it back to df.columns. You can also use capitalize(),
which is titlecase, or upper() as shown:

In [6]:
df.columns=[x.upper() for x in df.columns]
print(df.columns)

Index(['MONTH', 'TRIP_ID', 'REGION_ID', 'VEHICLE_ID', 'STARTED_AT', 'ENDED_AT',
       'DURATION', 'START_LOCATION_NAME', 'END_LOCATION_NAME', 'USER_ID',
       'TRIP_LEDGER_ID'],
      dtype='object')


In [7]:
#You could also make the DURATION field lowercase using the rename method, as shown:
df.rename(columns={'DURATION':'duration'},inplace=True)

You can pass an object
with multiple column name remapping. For example, you can remove the underscore in
region_id using rename. In the following code snippet, we change the DURATION
column to lowercase and remove the underscore in region_id:

In [8]:
df.rename(columns={'DURATION':'duration','region_id':'region'},inplace=True)