# Python and Conda Environments

#### <b> Why are environments required? </b>
- Isolation - Can operate independent of other environments
- Reproducibility - Easily share environments and use it on other systems
- Compatibility - Helps in maintaining versions

# Jupyter Notebooks

- Allow you to mix code, equations, visualizations and narrative text in the same document
- Allows you to execute code incrementally
- Have support for many documents like images, interactive widgets, markdown, etc

# Numpy

#### <b> Why is Numpy used? </b>
- Fast & Memory efficient - uses C/C++ as a backend
- Vectorization - instead of performing elementwise operations, vector operations are performed
- Integration - as it is widely used, it is well integrated into almost all the popular frameworks

In [1]:
import numpy as np

### Creating a Numpy Array

Useful functions
- np.array
- np.zeros, np.ones, np.full, np.random
- np.linspace, np.arange

#### np.array

Used to create arrays from existing lists, np arrays, etc

In [2]:
arr = [1, 2, 3, 4, 5, 6, 7, 8, 9]

np_arr = np.array(arr)

print(type(np_arr))
np_arr

<class 'numpy.ndarray'>


array([1, 2, 3, 4, 5, 6, 7, 8, 9])

##### Datatypes

Most useful
- float32
- float64
- int32

In [3]:
arr = [1, 2, 3, 4, 5, 6, 7, 8, 9]

np_arr = np.array(arr, dtype=np.float32)

np_arr

array([1., 2., 3., 4., 5., 6., 7., 8., 9.], dtype=float32)

#### np.zeros, np.ones, np.full, np.random

Help to make arrays with certain properties
- np.zeros, np.ones - fill arrays with respective numbers
- np.full - fills arrays with specified number
- np.random - fills arrays with specified randomness

In [4]:
shape = (5,)

np_arr = np.zeros(shape=shape)

np_arr

array([0., 0., 0., 0., 0.])

In [5]:
shape = (5,)

np_arr = np.full(shape=shape, fill_value=2)

np_arr

array([2, 2, 2, 2, 2])

In [6]:
print(np.random.rand(5,))
print(np.random.normal(size=(5,)))

[0.76864232 0.81756236 0.24090199 0.58557146 0.74015569]
[-0.17836677  0.34707273 -0.40404425 -0.25167733 -0.68112676]


##### Dimensions and Shaping

As numpy is used to make vectors, matrices and tensors, dimensions become very important

NOTE: (n, ) and (n, 1) are not the same shapes!

In [7]:
arr_1 = [1, 2, 3, 4, 5, 6, 7, 8]
arr_2 = [
    [1, 2, 3, 4],
    [5, 6, 7, 8],
]
arr_3 = [
    [
        [1, 2],
        [3, 4]
    ],
    [
        [5, 6],
        [7, 8]
    ]
]

np_arr_1 = np.array(arr_1)
np_arr_2 = np.array(arr_2)
np_arr_3 = np.array(arr_3)

print(np_arr_1.shape)
print(np_arr_2.shape)
print(np_arr_3.shape)

(8,)
(2, 4)
(2, 2, 2)


np.zeros, np.ones, np.full, np.empty have their analogues np.zeros_like, np.ones_like, np.full_like, np.empty_like to make arrays which take the shape of some other given array. The datatype is also taken by the new array unless specified

In [8]:
np.zeros_like(np_arr_3).shape

(2, 2, 2)

##### Reshaping

In ML, especially Deep Learning, we need to reshape arrays to fit our model.

The new shape must be compatible with the older shape i.e. the number of elements should match with the new shape

In [9]:
print("Original Array:\n", np_arr_3)
print()
print("2D Array:\n", np_arr_3.reshape(2, 4))
print()
print("1D Array:\n", np_arr_3.reshape(8,))

Original Array:
 [[[1 2]
  [3 4]]

 [[5 6]
  [7 8]]]

2D Array:
 [[1 2 3 4]
 [5 6 7 8]]

1D Array:
 [1 2 3 4 5 6 7 8]


When we don't care about one of the dimensions, we can keep it unspecified for numpy to adjust accordingly. For example, if the original shape is (8, ) and we need to keep the first dimension as 2, while not caring about the second dimension, we can use (2, -1) to specify the new shape.

In [10]:
print("Original Array:\n", np_arr_3)
print()
print("2D Array:\n", np_arr_3.reshape(2, 4))
print()
print("2D Array without specifyng the second dimension:\n", np_arr_3.reshape(2, -1))
print()

Original Array:
 [[[1 2]
  [3 4]]

 [[5 6]
  [7 8]]]

2D Array:
 [[1 2 3 4]
 [5 6 7 8]]

2D Array without specifyng the second dimension:
 [[1 2 3 4]
 [5 6 7 8]]



##### Slicing

Just like slicing lists in Python, we can slice arrays in Numpy. It works the same way in not only Numpy but with other libraries like Pandas, PyTorch, Tensorflow, etc

In [11]:
arr = np.arange(1, 10) # gives an array like [1, 2, ..., 9]

print(arr)
print(arr[:5])
print(arr[2:5])
print(arr[::2])

[1 2 3 4 5 6 7 8 9]
[1 2 3 4 5]
[3 4 5]
[1 3 5 7 9]


In [12]:
arr = arr.reshape(3,3)

In [13]:
print(arr)
print(arr[:, :1])
print(arr[:2])
print(arr[:2, :2])

[[1 2 3]
 [4 5 6]
 [7 8 9]]
[[1]
 [4]
 [7]]
[[1 2 3]
 [4 5 6]]
[[1 2]
 [4 5]]


##### Joining Arrays

- Concatenating, Stacking

In [14]:
arr = [1, 2, 3]
np_arr = np.array(arr)

print("Original Array:", np_arr)
print("With axis 0:\n", np.concatenate([np_arr, np_arr], axis=0))

np_arr = np_arr.reshape(-1, 1)

print("Reshaped Array:\n", np_arr)
print("With axis 1:\n", np.concatenate([np_arr, np_arr], axis=1))
print("With axis 0:\n", np.concatenate([np_arr, np_arr], axis=0))

Original Array: [1 2 3]
With axis 0:
 [1 2 3 1 2 3]
Reshaped Array:
 [[1]
 [2]
 [3]]
With axis 1:
 [[1 1]
 [2 2]
 [3 3]]
With axis 0:
 [[1]
 [2]
 [3]
 [1]
 [2]
 [3]]


Other techniques like stacking can also be used. Stacking has different options for vertical stacking, horizontal stacking, etc

#### Linear Algebra and Arithmetic with numpy

As numpy is specially made to deal with Linear Algebra, it is very convenient to perform such operations in Numpy

##### <b>Scalar Multiplication</b> 

In [15]:
arr = np.arange(1, 5) # looks like [1, 2, 3, 4]
scalar = 2

print(arr)
print(scalar)
print(arr * scalar)

[1 2 3 4]
2
[2 4 6 8]


##### <b>Vector Multiplication</b> 

In [16]:
# DOT PRODUCT??

arr_1 = np.array([1, 1])
arr_2 = np.array([-1, 1])

print("DOT Product??:",arr_1*arr_2)

DOT Product??: [-1  1]


For Dot Products, Numpy has the operator '@' or we could use np.dot and np.matmul. This is treated as a vector operation and not elementwise, which makes Numpy very fast.

In [17]:
# DOT PRODUCT??

arr_1 = np.array([1, 1])
arr_2 = np.array([-1, 1])

print("DOT Product:",arr_1@arr_2)
print("DOT Product:",np.dot(arr_1, arr_2))
print("DOT Product:",np.matmul(arr_1, arr_2))

DOT Product: 0
DOT Product: 0
DOT Product: 0


Numpy also has the functionality to perform inner and outer products.

In [18]:
arr_1 = np.array([1, 2, 3])
arr_2 = np.array([2, 2, 2])

print("Inner Join:", np.inner(arr_1, arr_2)) # will treat vectors as 1xn and nx1
print("Outer Join:\n", np.outer(arr_1, arr_2)) # will treat vectors as nx1 and nxn

Inner Join: 12
Outer Join:
 [[2 2 2]
 [4 4 4]
 [6 6 6]]


# Pandas


Why Pandas?

Data is not always in the form of numbers. Seeing a 2D NumPy array may not make sense. But what if these rows and columns had labels?

Most of our data is in the form an excel sheet, columns having particular names. 

This is why we use Pandas.

In [19]:
import numpy as np
import pandas as pd

## Types of objects in Pandas

Pandas has two core objects:

	- Series: For Dealing with 1D data.
	- DataFrame: For Dealing with 2D data.

Let's look at DataFrames as we will be using them the most.

## Reading Files

Pandas allows you to import data files into dataframes. The formats supported are csv, tsv, excel, parquet and pickle.

These file formats are few which you will be coming across often.

In [20]:
housing=pd.read_csv("Housing.csv")  # arguments can be altered to other value separated files.

#pd.read_excel(File_Name) for Excel
#pd.read_parquet for Parquet
#pd.read_pickle for Pickle

# To write back the processed DataFrame you can use to_csv, to_excel, etc.

### Viewing the dataframe

In [21]:
housing.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


In [22]:
housing.tail(10)

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
535,2100000,3360,2,1,1,yes,no,no,no,no,1,no,unfurnished
536,1960000,3420,5,1,2,no,no,no,no,no,0,no,unfurnished
537,1890000,1700,3,1,2,yes,no,no,no,no,0,no,unfurnished
538,1890000,3649,2,1,1,yes,no,no,no,no,0,no,unfurnished
539,1855000,2990,2,1,1,no,no,no,no,no,1,no,unfurnished
540,1820000,3000,2,1,1,yes,no,yes,no,no,2,no,unfurnished
541,1767150,2400,3,1,1,no,no,no,no,no,0,no,semi-furnished
542,1750000,3620,2,1,1,yes,no,no,no,no,0,no,unfurnished
543,1750000,2910,3,1,1,no,no,no,no,no,0,no,furnished
544,1750000,3850,3,1,2,yes,no,no,no,no,0,no,unfurnished


In [23]:
print("shape:",housing.shape)
print("size:",housing.size)
print("columns:",housing.columns)

shape: (545, 13)
size: 7085
columns: Index(['price', 'area', 'bedrooms', 'bathrooms', 'stories', 'mainroad',
       'guestroom', 'basement', 'hotwaterheating', 'airconditioning',
       'parking', 'prefarea', 'furnishingstatus'],
      dtype='object')


In [24]:
housing.dtypes

price                int64
area                 int64
bedrooms             int64
bathrooms            int64
stories              int64
mainroad            object
guestroom           object
basement            object
hotwaterheating     object
airconditioning     object
parking              int64
prefarea            object
furnishingstatus    object
dtype: object

### How to see statistical data relating to every column?

The "describe" function of DataFrame allows you to see it for all numerical columns.

In [25]:
housing.describe()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,parking
count,545.0,545.0,545.0,545.0,545.0,545.0
mean,4766729.0,5150.541284,2.965138,1.286239,1.805505,0.693578
std,1870440.0,2170.141023,0.738064,0.50247,0.867492,0.861586
min,1750000.0,1650.0,1.0,1.0,1.0,0.0
25%,3430000.0,3600.0,2.0,1.0,1.0,0.0
50%,4340000.0,4600.0,3.0,1.0,2.0,0.0
75%,5740000.0,6360.0,3.0,2.0,2.0,1.0
max,13300000.0,16200.0,6.0,4.0,4.0,3.0


### Indexing


In [26]:
housing[:4]

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished


In [27]:
housing.loc[:5]

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished
5,10850000,7500,3,3,1,yes,no,yes,no,yes,2,yes,semi-furnished


In [28]:
housing.loc[:5,"price":"basement"]

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement
0,13300000,7420,4,2,3,yes,no,no
1,12250000,8960,4,4,4,yes,no,no
2,12250000,9960,3,2,2,yes,no,yes
3,12215000,7500,4,2,2,yes,no,yes
4,11410000,7420,4,1,2,yes,yes,yes
5,10850000,7500,3,3,1,yes,no,yes


In [29]:
housing.iloc[:5,:8]

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement
0,13300000,7420,4,2,3,yes,no,no
1,12250000,8960,4,4,4,yes,no,no
2,12250000,9960,3,2,2,yes,no,yes
3,12215000,7500,4,2,2,yes,no,yes
4,11410000,7420,4,1,2,yes,yes,yes


## Joins/Merge/Concatenating

In [30]:
housing_1=housing.iloc[:,:3]
housing_2=housing.iloc[:,3:]

In [31]:
housing_1

Unnamed: 0,price,area,bedrooms
0,13300000,7420,4
1,12250000,8960,4
2,12250000,9960,3
3,12215000,7500,4
4,11410000,7420,4
...,...,...,...
540,1820000,3000,2
541,1767150,2400,3
542,1750000,3620,2
543,1750000,2910,3


In [32]:
housing_2

Unnamed: 0,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,2,3,yes,no,no,no,yes,2,yes,furnished
1,4,4,yes,no,no,no,yes,3,no,furnished
2,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,2,2,yes,no,yes,no,yes,3,yes,furnished
4,1,2,yes,yes,yes,no,yes,2,no,furnished
...,...,...,...,...,...,...,...,...,...,...
540,1,1,yes,no,yes,no,no,2,no,unfurnished
541,1,1,no,no,no,no,no,0,no,semi-furnished
542,1,1,yes,no,no,no,no,0,no,unfurnished
543,1,1,no,no,no,no,no,0,no,furnished


In [33]:
housing_1.join(housing_2)

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished
...,...,...,...,...,...,...,...,...,...,...,...,...,...
540,1820000,3000,2,1,1,yes,no,yes,no,no,2,no,unfurnished
541,1767150,2400,3,1,1,no,no,no,no,no,0,no,semi-furnished
542,1750000,3620,2,1,1,yes,no,no,no,no,0,no,unfurnished
543,1750000,2910,3,1,1,no,no,no,no,no,0,no,furnished


In [34]:
housing_1.join(housing_2,how="cross")

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,13300000,7420,4,4,4,yes,no,no,no,yes,3,no,furnished
2,13300000,7420,4,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,13300000,7420,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,13300000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished
...,...,...,...,...,...,...,...,...,...,...,...,...,...
297020,1750000,3850,3,1,1,yes,no,yes,no,no,2,no,unfurnished
297021,1750000,3850,3,1,1,no,no,no,no,no,0,no,semi-furnished
297022,1750000,3850,3,1,1,yes,no,no,no,no,0,no,unfurnished
297023,1750000,3850,3,1,1,no,no,no,no,no,0,no,furnished


In [35]:
pd.concat([housing_1,housing_2],axis=1)

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished
...,...,...,...,...,...,...,...,...,...,...,...,...,...
540,1820000,3000,2,1,1,yes,no,yes,no,no,2,no,unfurnished
541,1767150,2400,3,1,1,no,no,no,no,no,0,no,semi-furnished
542,1750000,3620,2,1,1,yes,no,no,no,no,0,no,unfurnished
543,1750000,2910,3,1,1,no,no,no,no,no,0,no,furnished


Grouping can be done based on columns (generally categorical) to see category wise statistics

In [36]:
grouped=housing.groupby(by="bedrooms")

In [37]:
grouped.mean()

  grouped.mean()


Unnamed: 0_level_0,price,area,bathrooms,stories,parking
bedrooms,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,2712500.0,3710.0,1.0,1.0,0.0
2,3632022.0,4636.235294,1.058824,1.169118,0.492647
3,4954598.0,5226.62,1.266667,1.933333,0.723333
4,5729758.0,5582.063158,1.621053,2.305263,0.915789
5,5819800.0,6291.5,1.8,2.0,0.6
6,4791500.0,3950.0,1.5,2.0,0.5
