# Data Manipulation - Filters

## Learnings:

- rename columns in a DataFrame
- manipulate columns in a DataFrame (select, reorder, delete)
- filter dataframe
- assign to a column based on a condition

In [4]:
import pandas as pd

data = pd.read_csv('vehicles.csv')
data.head(2)

Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,17,522.764706,1950
1,AM General,FJ8c Post Office,1984,4.2,6.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550


In [5]:
data.shape

(35952, 15)

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35952 entries, 0 to 35951
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Make                     35952 non-null  object 
 1   Model                    35952 non-null  object 
 2   Year                     35952 non-null  int64  
 3   Engine Displacement      35952 non-null  float64
 4   Cylinders                35952 non-null  float64
 5   Transmission             35952 non-null  object 
 6   Drivetrain               35952 non-null  object 
 7   Vehicle Class            35952 non-null  object 
 8   Fuel Type                35952 non-null  object 
 9   Fuel Barrels/Year        35952 non-null  float64
 10  City MPG                 35952 non-null  int64  
 11  Highway MPG              35952 non-null  int64  
 12  Combined MPG             35952 non-null  int64  
 13  CO2 Emission Grams/Mile  35952 non-null  float64
 14  Fuel Cost/Year        

## Checking the dataframe column names

Rename all columns at once:
- `data.columns` is an **attribute** of the DataFrame which results in a list-like of the column names
    - You can substitute it by another list containing the names you want 
    - Note you have to substitute the whole set of column names at once
    
- `data.rename()` is a **method** of a DataFrame, in which you can rename one column at once
    - You just need to pass a dictionary containing {'old_name':'new_name'} 
    - By default, it changes names of a **index** (`axis=0`), you can specify `axis=1` to change **column** names
    - the `inplace` argument

In [None]:
data.columns

### Substituting `.columns` attribute

In [None]:
# say for example we want to convert all columns to lowercase!

In [None]:
data.columns = ['make', 'Model', 'Year', 'Engine Displacement', 'Cylinders',
               'Transmission', 'Drivetrain', 'Vehicle Class', 'Fuel Type',
               'Fuel Barrels/Year', 'City MPG', 'Highway MPG', 'Combined MPG',
               'CO2 Emission Grams/Mile', 'xxxxxxx']

In [None]:
data.head()

In [None]:
data.columns = ['Make', 'Model', 'Year', 'Engine Displacement', 'Cylinders',
               'Transmission', 'Drivetrain', 'Vehicle Class', 'Fuel Type',
               'Fuel Barrels/Year', 'City MPG', 'Highway MPG', 'Combined MPG',
               'CO2 Emission Grams/Mile', 'Fuel Cost/Year']

In [None]:
data.columns

In [None]:
colnames = []
for col in data.columns:
    colnames.append(col.lower())

In [None]:
data.columns = [col.lower().replace(' ','_').replace('/','_') for col in data.columns]

In [None]:
data.head()

In [None]:
data.columns = ['manufacturer']

### `.rename() method`

`.rename({'old_column':'new_column'})`

#### returning a new dataframe

In [None]:
data.rename({'make': 'manufacturer'}, axis=1)

In [None]:
data.rename(columns={'make': 'manufacturer', 'year':'model_year'})

In [None]:
data = data.rename(columns={'make': 'manufacturer', 'year':'model_year'})

In [None]:
data.head()

#### inplace

In [None]:
data.rename({'engine_displacement': 'engine_displacement2',
             'vehicle_class': 'vehicle_class2'}, axis=1, inplace=True)

In [None]:
# dataframe already changed
data.head()

If you try to assign an `inplace=True` command, check what happens:

In [None]:
data.rename({'year3': 'year10'}, axis=1)

In [None]:
y = data.rename({'year': 'year3'}, axis=1, inplace=True)

In [None]:
data.head()

In [None]:
print(y)

Two options:
> 1. store it again on the variable `data`: 

    data = data.rename(columns={'Make':'Manufacturer', 'Year':'ANO'})
> 2. Use the inplace argument `inplace =  True` to change the values within the dataframe automatically

    data.rename(columns={'Make':'Manufacturer', 'Year':'ANO'}, inplace=True)
    

In [None]:
# You can also assign to a different variable, of course
renamed_data = data.rename(columns={'make':'Manufacturer', 'year3':'ANO'})

In [None]:
renamed_data.head(2)

In [None]:
data.head(2)

## Reordering columns in a dataframe

>    - Remember you always pass a list of columns to access a dataframe

Just select the columns in a different order and overwrite the previous dataframe

In [None]:
data[['make','model']]

In [None]:
data[['model', 'make']]

In [None]:
data.columns

In [None]:
data = data[['fuel_cost_year', 'make', 'model', 'year3', 'engine_displacement2', 'cylinders',
       'transmission', 'drivetrain', 'vehicle_class2', 'fuel_type',
       'fuel_barrels_year', 'city_mpg', 'highway_mpg', 'combined_mpg',
       'co2_emission_grams_mile']]

In [None]:
data

In [None]:
data.loc[:, 'model','make'] # WRONG - not a list, you passed a string, string - not a list.

How can I get the `fuel cost/year` variable and put it at the beginning of the dataframe

In [None]:
data.columns

In [None]:
column_order = ['co2_emission_grams_mile', 'fuel_cost_year', 'make', 'model', 'year3', 'engine_displacement2',
       'cylinders', 'transmission', 'drivetrain', 'vehicle_class2',
       'fuel_type', 'fuel_barrels_year', 'city_mpg', 'highway_mpg',
       'combined_mpg', ]

data = data.loc[:, column_order]

In [None]:
# problems you may handle

# auto-assign a subset of the dataframe
data = data['make']

In [None]:
data.head(2)

In [None]:
data = pd.read_csv('data/vehicles.csv')

In [None]:
# assign an inplace=True command:
data = data.rename({'Year':'Model_Year'}, axis=1, inplace=True)

In [None]:
data.head(2)

In [None]:
print(data)

## Remove column (or row)

- The `.drop()` method
- By default, `.drop()` drops a row given its index.

In [None]:
data = pd.read_csv('data/vehicles.csv')

In [None]:
data.drop('Year')

In [None]:
data.drop('Year')

In [None]:
data.drop('Year', axis=1)

In [None]:
data.drop(1)

In [None]:
data.drop(1).reset_index(drop=True)

# Filter records
>    - `mask` concept
>    - `.query()` method

This is really important for data wrangling.

In [None]:
data = pd.read_csv('data/vehicles.csv')

In [None]:
data.head(2)

## Simple Example: Starting with a numpy array. How can I filter the values of a list?

In [None]:
import numpy as np

In [None]:
my_array = np.array([1,2,3,4,5,6,7,8,9,10])

In [None]:
my_array

In [None]:
my_array * 10

In [None]:
my_array > 5

The results of `my_array > 5` is what is called **a mask**. A result containing the `True` and `False` results of an operation. 

In [None]:
my_array[5:]

In [None]:
my_array[ [False, False, False, False, False,  True,  True,  True,  True, True] ]

In [None]:
my_array[my_array < 8]

Masks can be used as an index to select data!

In [None]:
my_array[ [False, False, False, False, False,  True,  True,  True,  True, True] ]

In [None]:
my_array[ my_array > 5 ]

After selecting, you can do anything with it, for example assigning it. This operation is called a `vectorial` operation. It is done all at once.

In [None]:
my_array[my_array > 5] = 1000

In [None]:
my_array

In [None]:
my_matrix = np.random.randint(0, 10, size=(5,5))
my_matrix

In [None]:
my_matrix > 5

In [None]:
my_matrix[ my_matrix > 5 ] = -99999

In [None]:
my_matrix

In [None]:
my_array[ my_array > 5 ] = 10

In [None]:
my_array

You can also save the condition

In [None]:
my_array = np.array([1,2,3,4,5,6,7,8,9,10])

In [None]:
my_array

In [None]:
condition = my_array > 5 
condition

In [None]:
my_array

In [None]:
my_array[ condition ]

## Bitwise logical operators - Combining conditions

To make more than one condition together, you can use 
- `&` - analogous to `and`
- `|` - analogous to `or` 

For example, get all numbers from my_array that are greater than 3 and smaller than 8

Let's do it in steps:
- get values greater than 3

In [None]:
my_array[my_array > 3]

- get values smaller than 8

In [None]:
my_array[my_array < 8]

- get values greater than 3 and smaller than 8

In [None]:
greater_than_3 = my_array > 3

In [None]:
smaller_than_8 = my_array < 8

In [None]:
(my_array > 3) & (my_array < 8)

In [None]:
# (my_array > 3) or (my_array < 8)
(my_array > 3) | (my_array < 8)


In [None]:
(my_array > 3) & (my_array < 8)

In [None]:
greater_than_3 & smaller_than_8

## Now in a dataframe

Let's find the rows in which the Cylinders values are exactly 6.

In [None]:
data

In [None]:
data['Cylinders'] == 4

In [None]:
data['Cylinders'] == 4

In [None]:
data.loc[:, 'Cylinders']

In [None]:
data.loc[data['Cylinders'] == 4, :]

### Example

In [None]:
# create a column with all zeroes named - 'fl_city_car'

data['fl_city_car'] = 0

In [None]:
(data['City MPG']) > (data['Highway MPG'])

In [None]:
# assign 1 to 'fl_city_car' all cars that have 'City MPG' > 'Highway MPG'

data.loc[(data['City MPG']) > (data['Highway MPG']), 'fl_city_car'] = 1

In [None]:
data.loc[(data['City MPG']) > (data['Highway MPG']), 'fl_city_car'] = 10

In [None]:
data.loc[(data['City MPG']) > (data['Highway MPG']), 'fl_city_car']

## You can combine conditions

Cars from `Ford` and 6 `Cylinders`

In [None]:
data.loc[:, :]

In [None]:
data['std'] = data[['City MPG','Highway MPG','Combined MPG']].std(axis=1)

In [None]:
data.loc[data['std'] != 0, :]

In [None]:
data.loc[(data['Cylinders'] == 6) & (data['Make'] == 'Ford'), :]

In [None]:
# careful with:

data.loc[data['Make']=='Ford' & data['Cylinders']==6, :] # WRONG!!

## You can put the conditions in variables as well

In [None]:
condition1 = (data['Make']=='Ford')
condition2 = (data['Cylinders']==6)
condition3 = (data['Combined MPG'] < 18)

In [None]:
data.loc[condition1 & condition2 & condition3, :]

## Another way to do the same thing.

* using the method `query`

The method `query` receives a string in which you can say your condition. Important things:
- `.query()` is a method of your dataframe
- `.query()` method receives a string 
- Every word inside the string that is not `quoted` is considered a variable of your dataframe (so, for example `.query('Year == 1999')` will look for the variable `Year`. Another example: if you try to run `.query('Make == Ford')` will look both for the column name `Make` and the column named `Ford`. If you want the results of the column `Make` to match the **string** Ford, you have to run `.query('Make == "Ford"')`
- If your column has spaces, you have to call it using backticks like in **.query('\`Engine Displacement\` < 4')**:

In [None]:
indexes = list(data.index)
indexes.insert(0, 'Make')

In [None]:
indexes.remove(0)

In [None]:
data.index = indexes

In [None]:
data.query('Make == "Ford"')

In [None]:
data.query('Cylinders == 4 and Make == "Ford"')

In [None]:
data.query('`City MPG` > `Highway MPG`')

In [None]:
data.query('Cylinders == 4')

In [None]:
numero_cilindros = 6
data.query(f'Make == "Acura" and Cylinders == {numero_cilindros}')

In [None]:
data.query('`City MPG` > `Highway MPG`')

In [None]:
numero_cilindros = 4
data.query(f'Make == "Acura" and Cylinders == {numero_cilindros}')