<a href="https://colab.research.google.com/github/tavi1402/Data_Science_bootcamp/blob/main/OOPS_with_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classes and Object Oriented Programming in Python

This tutorial is a part of the [Zero to Data Science Bootcamp by Jovian](https://www.jovian.ai/data-analyst-bootcamp)

![](https://i.imgur.com/yBsPHnF.png)

Object-oriented programming (OOP) is a method of structuring programs into _objects_ that encapsulate _data_ and _functionality_. For examples, Numpy arrays and Pandas data frames are objects that contain data and offer methods to retrieve, manipulate and perform operations on the data stored within them.

Python is an object oriented language, everything in Python is an object. Every object in Python is an _instance_ of a class. Classes are blueprints for creating objects. In this tutorial, we'll explore how to create new classes and objects in Python.

This tutorial covers the following topics:

- Defining classes and creating objects
- Class constructor, properties and methods
- Implementing "dunder" methods for easier usage
- Getters, setters, static methods & class methods
- Inheritance, overriding and abstract methods

### How to Run the Code

The best way to learn the material is to execute the code and experiment with it yourself. This tutorial is an executable [Jupyter notebook](https://jupyter.org). You can _run_ this tutorial and experiment with the code examples in a couple of ways: *using free online resources* (recommended) or *on your computer*.

#### Option 1: Running using free online resources (1-click, recommended)

The easiest way to start executing the code is to click the **Run** button at the top of this page and select **Run on Binder**. You can also select "Run on Colab" or "Run on Kaggle", but you'll need to create an account on [Google Colab](https://colab.research.google.com) or [Kaggle](https://kaggle.com) to use these platforms.


#### Option 2: Running on your computer locally

To run the code on your computer locally, you'll need to set up [Python](https://www.python.org), download the notebook and install the required libraries. We recommend using the [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) distribution of Python. Click the **Run** button at the top of this page, select the **Run Locally** option, and follow the instructions.





## Problem Statement - Implementing Pandas Data Frames from Scratch

To understand classes, we'll attempt to implement Pandas data frames from scratch in Python.

![](https://i.imgur.com/zfxLzEv.png)

Here's some of the functionality we'll try to replicate.

## Defining classes and creating objects

A class is a blueprint for creating an object. Classes are defined using the `class` keyword. The _body_ of a class is an indented block of code that defines its functionality. Here's the simplest way of defining a class:

Note that the body contains just one statement `pass`, which does nothing i.e. the class has no functionality.

We can now create an object of the class by invoking the class like a function.

We just created an object of the class `DataFrame`. However, we have to way to access the object. We can do so by creating a variable.

The variable `df1` holds a reference the object, and can be used to retrieve the object.

When we invoke `DataFrame()` again, it creates a new object.

You can tell that the objects are different because they are at different addresses in the RAM (the address is the last portion of the output).

Note that we can have multiple variables pointing to the same object, simply by reassigning variables.

`df2` and `df3` point to the same object, but `df1` points to a different object. More precisely, `df2` and `df3` point to the same _location in memory_, while `df1` points to a different _memory location_.

You can check if two variables point to the same object using the `is` operator, which compares the memory address of the two variables.

## Class constructor, properties and methods

Our data frame objects aren't doing much. They don't store any data or offer any functionality. Let's give the the ability to store some data.

We'll store a fixed dictionary in each object that's created, by defining a _constructor method_, which is executed automatically when an object is created.


Note the following in the definition above:

- The double underscores in `__init__`
- The self argument passed to `__init__`, which will be set to the object that is created.
- Setting a property on `self` called `data`. We can name a property anything we wish (val, number, the_thing_inside etc. )

Let's create an object of this class.

We can now access the property `data` of `df4`.

Internally, what's happening is that Python first creates an empty object, stores the reference to the empty object in an temporary variable called `self`, calls the `__init__` function with `self` as the argument, which then sets the property `data` on the created object with the value `{'a': 1}`. Finally, the object is assigned to the variable `df4`.

We can not only access, but also change the value of the property `data`.

Note that every new object will contain it's own local copy of the `data` property.

We can also set the initial value of the property while creating the object, by passing arguments to the constructor.

The value for the argument `data` can be passed while invoking `DataFrame` to create a new object.

Note that, we can no longer invoke `DataFrame` without arguments.

Let's define another property `columns`, which is set to the list of columns of the dataframe.

Next, let's define a method `get_column`, which retrieves the values in a given column.

Note, that the `df10` is automatically passed as the `self` argument to `get_column`.

In fact, the above call is the same as:

Let's implement a method `get_row` which can be used to retrieve the row at a given position in the data frame, as a dictionary.

Let's also add a `copy` method to easily create copies of data frames. We'll use the `copy` module to create a deep copy of the dictionary.

Verify that the `data` in `df13` is indeed a copy, and modifying it won't affect the `data` in `df12`.

Our `DataFrame` class now contains the following functionality:

- A constructor method that can be used to pass a dictionary of data
- A `data` property that can be used to access the dictionary
- A `columns` property that can be used to get a list of columns
- A `get_column` method for getting the list of values in a column
- A `get_row` method for getting a row of data as a dictionary.
- A `copy` method for create a copy of the data frame.

<a name='exercise_1'></a>
> **EXERCISES**: Enhance the implementation of `DataFrame` to include the following:
>
> 1. Ensure that `data` argument to the constructor is a dictionary, and that each value in the dictionary is a list of the same length. If these conditions are not satisfied, raise an exception.
> 2. Add a property `shape` which returns a tuple containing the number of rows and number of columns in the data frame
> 3. Add a property `get_element` which extract a single value from a data frame, given a column name and row index.

Let's save our work before continuing.

## Implementing "dunder" methods for easier usage

Our implementation of `DataFrame` is shaping up well, however it still faces several limitations, which we'll discuss and address one by one in this section.

### String representation using `__str__` and `__repr__`

We can't view the contents of a `DataFrame` object in the same way we view the contents of a Pandas data frame.


We can add this by implementing the `__repr__` and `__str__` methods in the class. These are special methods in Python (also called "double underscore methods" or "dunder methods").

We'll use a helper library called `tabulate` to create a table-like output for out dataframe.

What's the difference between `__str__` and `__repr__`? Look it up!

Great, we now have a readable string representation of our data.

### Length using `__len__`

We can find the number of rows in a Pandas dataframe using the `len` function.

However, our implementation of `DataFrame` does not support this.

To support usage with the `len` function, we can define the `__len__` method.

Note that not every class you define would need to support the `len` method.

### `__getitem__` and  `__setitem__`

While we do have a method `get_column` to retrieve values in a column from our custom data frames, Pandas dataframes allow doing this easily using the indexing notation.

Further, pandas dataframes also allow creating new columns using the indexing notation.

To support the indexing notation for getting and creating columns, we can implement the `__getitem__` and `__setitem__` methods on our class.

We now have a way to access, add and modify columns in our dataframe.

### `__iter__`

Pandas dataframe also support iteration, and can be used in `for` loops. Each iteration of the the loop, we get access to one column of the dataframe.

To support iteration for custom classes, we can implement the `__iter__` method.

Note the use of the `yield` keyword, instead of `return`. This converts the function into a "generator" which returns a new value each time it is invoked.

We can now iterate over our dataframe using a `for` loop.

You can find a full list of "dunder" methods and their usage here: https://holycoders.com/python-dunder-special-methods/ . Keep in mind that only some dunder methods are relevant for any given class, and you needn't implement all (or any) of them for every class your create.

Let's save our work before continuing.

## Getters, setters, static methods and class methods

One of the issues with our implementation is that we can't reliably rename the columns of a dataframe, like we can in Pandas.

This error occurs because the key in the internal dict are not yet modified.

We can solve this issue by defining two functions for the `column` property: a "getter" and a "setter"

### Static Methods

We can also define methods in a class which are not bound to any specific object and can be used directly from the class.

### Class Method

Another special type of method is a classmethod, which receives the class constructor as the first argument, and is often used to create alternate ways of creating an object.

As an example, let's define a class method `read_json`, which can read data from a JSON file. Along with this, let's also add a normal method `to_json`.

<a name="exercise_2"></a>
**Exercises**:

1. Implement a class method `read_csv` and a normal method (also called instance method) `to_csv` to read and write from CSV files. You may find the `csvwriter` module useful.


2. Recall than Pandas dataframes also allow accessing columns using the `.` notation e.g. `pandas_df.Artist`. Add support for this behaviour in our implementation of the dataframe. Hint: Use the `__getattr__` dunder method.


3. Our current implementation does not support custom indexes. Implement two more classes `Index` and `Series`. An `Index` is simply a list of indices used within a dataframe. A `Series` encapsulate the values with a column and associates them with an `Index`. Study and replicate the functionality of the Pandas `Series` and `Index` classes.


4. Implement other commonly used methods and properties of pandas dataframes. Compare the performance of your implementations with those of Pandas dataframes. What causes the performance difference.



Let's save our work before continuing.

## Inheritance, overriding and abstract methods

Classes in Python can extend other classes i.e. they can inherit properties and methods from other classes. Here's an example of inheritance using geometric shapes.

![](https://i.imgur.com/BSCxOkG.png)

Let's create a circle and try using some of the methods.

Let's compare rectangles and triangles using methods from `Shape` and `Polygon`.

Let's create a square. We can use methods from `Rectangle`, `Shape` and `Polygon` in a square.

Let's save our work before continuing.

## Summary and Further Reading

The following topics were covered in this tutorial:

- Defining classes and creating objects
- Class constructor, properties and methods
- Implementing "dunder" methods for easier usage
- Getters, setters, static methods & class methods
- Inheritance, overriding and abstract methods

Check out the following resources to learn more:

- https://www.w3schools.com/python/python_classes.asp
- https://dabeaz-course.github.io/practical-python/Notes/04_Classes_objects/01_Class.html
- https://realpython.com/python3-object-oriented-programming/
- https://dbader.org/blog/python-dunder-methods
- https://realpython.com/python-super/

## Questions for Revision
1.	What is a class?
2.	What is an object?
3.	What does the body of class contain?
4.	How do you define a class?
5.	How do you define an empty class?
6.	What is pass?
7.	How do you invoke a class?
8.	How do you access the object created by a class?
9.	Let’s say `class_name()` is a class and `a=class_name()`, `b=class_name()`, will a and b have the same address?
10.	Similar to question 9, let’s say `a=b`, will a and be have the same address?
11.	How do you check is two variables point to the same object?
12.	What is a constructor method?
13.	What is `__init__(self)` ?
14.	How can you access the property of a class?
15.	Can you make changes to the value of the class's property?
16.	How can you set initial value of the property of a class?
17.	What is the purpose of copy module?
18.	What are dunder methods?
19.	What are `__repr__` and `__str__` methods?
20.	What is `__getattr__` method?
21.	What is tabulate library?
22.	How can you use `len()` in a class? What are the limitations?
23.	How can you implement indexing notation in a class?
24.	How is pandas dataframe different from class?
25.	What is `__iter__` method in class?
26.	What is yield?
27.	How can you resolve the issue of renaming a column name in class?
28.	What are static methods? Explain with an example.
29.	What is classmethod? Explain with an example.
30.	Explain the inheritance property of a class with an example.

## Solutions for Exercises

------------------------------------------------------------------------------

> **EXERCISES**: Enhance the implementation of `DataFrame` to include the following:
>
> 1. Ensure that `data` argument to the constructor is a dictionary, and that each value in the dictionary is a list of the same length. If these conditions are not satisfied, raise an exception.
> 2. Add a property `shape` which returns a tuple containing the number of rows and number of columns in the data frame
> 3. Add a property `get_element` which extract a single value from a data frame, given a column name and row index.

Reference [(click here)](#exercise_1)

> 1. Ensure that `data` argument to the constructor is a dictionary, and that each value in the dictionary is a list of the same length. If these conditions are not satisfied, raise an exception.

> 2. Add a property `shape` which returns a tuple containing the number of rows and number of columns in the data frame

> 3. Add a property `get_element` which extract a single value from a data frame, given a column name and row index.

---------------------------------------------------------------------

> **Exercises**:
>
> 1. Implement a class method `read_csv` and a normal method (also called instance method) `to_csv` to read and write from CSV files. You may find the `csvwriter` module useful.
> 2. Recall than Pandas dataframes also allow accessing columns using the `.` notation e.g. `pandas_df.Artist`. Add support for this behaviour in our implementation of the dataframe. Hint: Use the `__getattr__` dunder method.
> 3. Our current implementation does not support series. Implement class  `Series`. A `Series` encapsulate the values with a column and associates them with an `Index`. Study and replicate the functionality of the Pandas `Series` classes.
> 4. Implement other commonly used methods and properties of pandas dataframes. Compare the performance of your implementations with those of Pandas dataframes. What causes the performance difference.

Reference ([click here](#exercise_2))

> 1. Implement a class method `read_csv` and a normal method (also called instance method) `to_csv` to read and write from CSV files. You may find the `csvwriter` module useful.

>2.  Recall than Pandas dataframes also allow accessing columns using the `.` notation e.g. `pandas_df.Artist`. Add support for this behaviour in our implementation of the dataframe. Hint: Use the `__getattr__` dunder method.

> 3. Our current implementation does not support series. Implement class  `Series`. A `Series` encapsulate the values with a column and associates them with an `Index`. Study and replicate the functionality of the Pandas `Series` classes.

As each items in the Series is part of a dictionary, we can also get the individual element by accessing via the index.


> 4. Implement other commonly used methods and properties of pandas dataframes. Compare the performance of your implementations with those of Pandas dataframes. What causes the performance difference.

Let's first download a large csv to compare the difference between pandas DataFrame and our DataFrame. Let's randomly pick a dataset from Kaggle which has a reasonable amount of missing values.

Dataset Used: https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews

We will use the following operations to compare the speed of pandas DataFrame and our DataFrame.
1. Creating a DataFrame using `.read_csv`.
2. Showing the first 100 lines using `.head()` method.
3. Checking the shape of the DataFrame and amount of null values before dropping them.
4. Dropping Null values
5. Re-checking the shape of the DataFrame and if there are any null values left.

We will use the `%%time` magic command in every cell to keep a note of the time.

Notation of DataFrames

__Our DataFrame__ : dfe4

__Pandas DataFrame__: dfe5

1. Creating a DataFrame using .read_csv.

__Observation__: We can see that reading a CSV is faster in pandas

2. Showing the first 10 lines using `.head()` method.

__Observation__: For printing a dataframe too pandas is performing better.

3. Checking the shape of the DataFrame and amount of null values before dropping them.

__Observation__: Again here pandas wins the race with a huge margin.

4. Dropping Null values

__Observation__ : Here we can see that pandas is performing 10 times faster than our DataFrame. Thus, pandas is a lot efficent than our DataFrame.  

5. Confirming the shape of the DataFrame and the `.dropna()` method is functioning appropriately.

__Observation__: We can see that our `.dropna()` function is working perfectly.

__Final Observation__ : We can see from the above operation that pandas is performning way better than the custom DataFrame. Also, as we increase the size of the Dataset we will easily be able to understand the time difference in both of them. But Why?

Pandas library is using numpy under the hood. Both pandas and numpy are built on c++ which is a lot faster than Python. Moreover, each column in a pandas dataframe have a certain datatype and they are well indexed. The data structures used in pandas library are effective and efficient. For these reasons pandas perform a lot better than our custom made DataFrame.