# Pandas

## Data Science libraries

| Popular Python libraries used in data science |
| :---------------------- |
| **_Scientific Computing and Statistics_** |
| **NumPy** (Numerical Python)—Python does not have a built-in array data structure. It uses lists, which are convenient but relatively slow. NumPy provides the high-performance `ndarray` data structure to represent lists and matrices, and it also provides routines for processing such data structures.  |
| **SciPy** (Scientific Python)—Built on NumPy, SciPy adds routines for scientific processing, such as integrals, differential equations, additional matrix processing and more. `scipy.org` controls SciPy and NumPy.  |
| **StatsModels**—Provides support for estimations of statistical models, statistical tests and statistical data exploration. |
| **Data Manipulation and Analysis** |
| **Pandas**—An extremely popular library for data manipulations. Pandas makes abundant use of NumPy’s `ndarray`. Its two key data structures are `Series` (one dimensional) and `DataFrames` (two dimensional).   |
| |
| **_Visualization_** |
| **Matplotlib**—A highly customizable visualization and plotting library. Supported plots include regular, scatter, bar, contour, pie, quiver, grid, polar axis, 3D and text. |
| **Seaborn**—A higher-level visualization library built on Matplotlib. Seaborn adds a nicer look-and-feel, additional visualizations and enables you to create visualizations with less code.    |
| |
| **_Machine Learning, Deep Learning and Reinforcement Learning_** |
| **scikit-learn**—Top machine-learning library. Machine learning is a subset of AI. Deep learning is a subset of machine learning that focuses on neural networks.  |
| **Keras**—One of the easiest to use deep-learning libraries. Keras runs on top of TensorFlow (Google). |
| **TensorFlow**—From Google, this is the most widely used deep learning library. TensorFlow works with GPUs (graphics processing units) or Google’s custom TPUs (Tensor processing units) for performance. TensorFlow is important in AI and big data analytics—where processing demands are huge. You’ll use the version of Keras that’s built into TensorFlow. |
| **OpenAI Gym**—A library and environment for developing, testing and comparing reinforcement-learning algorithms.   |
|  |

## Ipython and Jupyter Notebook intro

* Jupyter lab works similarly like IPython REPL or Read-Eval-Print-Loop just as shown below
    <div>
        <img src="attachment:Screenshot%202023-06-22%20at%2010.08.22%20PM-3.png" width="500"/>
    </div>
* Please refer the jupyter lab website for product information and if you want to try or test things you can do it here. [Jupyter Lab](https://jupyter.org/try-jupyter/lab/?path=notebooks%2FIntro.ipynb)
* As for the Jupyter Notebook, you have two modes.
    * **Interactive mode** for executing **snippets** and immediately seeing their results.
    * **Script mode** for executing **scripts** or **programs** in **`.py`** files.

## Jupyter Notebook installation

* Open the terminal or cmd in the machine and run the following command


> pip3 install jupyter notebook 

or

> pip3 install jupyter lab

* After the installation, to open the notebook type the following command in terminal or cmd
> jupyter notebook

* Once open you can create a new ipynb file or open already existed file from any location.

### To run a cell in notebook

To run a cell we can use **`ctrl + enter`** command which executes current cell or use **`shift + enter`** command to run the cell and select next cell.

Please refer the given article, if you want to know more commands or shortcuts.
[Shortcuts](https://www.edureka.co/blog/wp-content/uploads/2018/10/Jupyter_Notebook_CheatSheet_Edureka.pdf)

In [None]:
import pandas as pd
import numpy as np

### Numpy

* NumPy is short for `Numerical Python` and it is used for working with arrays. 
* It adds powerful data structures to Python that guarantee efficient calculations with arrays and matrices.
* At the core of the NumPy package, is the ndarray object. This encapsulates n-dimensional arrays of homogeneous data types, with many operations being performed in compiled code for performance. 
* Installation through terminal or cmd  `pip3 install pandas`


In [None]:
arr = np.array([1,2,3,4]) # 1 dimensional array
arr # In notebook if you want something to be printed it can be used last and no need to use print

In [None]:
arr2 = np.array([[2,4,6],[1,5,7]]) # 2 dimensional array
print(arr2)
arr2.shape # shape of the array, which will return 2 rows, 3 columns

In [None]:
np.__version__ # to find which version is installed you can call dunder(double underscore) method of version

In [None]:
type(arr) # you can use type of any data to get the type of the object, method or class

## Pandas

* pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,built on top of the Python programming language.
* Pandas has two two most common data structures – Series and DataFrame
-- A `Pandas Series` is like a column in a table. It is a one-dimensional array holding data of any type.
      ```python
      data = [10, 20, 30, 40, 50]
      series = pd.Series(data)
      print(series)
      
      Output : 0    10
               1    20
               2    30
               3    40
               4    50
               dtype: int64
      ```
-- When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a `Pandas DataFrame`.

# 1. Create

### 1.1 Create from a CSV

In [None]:
df = pd.read_csv('telco_churn.csv')

### 1.2 Create from a Dictionary

In [None]:
tempdict = {'col1':[1,2,3], 'col2':[4,5,6], 'col3':[7,8,9]}

In [None]:
dictdf = pd.DataFrame.from_dict(tempdict)

In [None]:
dictdf

In [None]:
type(dictdf)

### 1.3 create from pd instance

In [None]:
data = {
  "car-model": ['kia', 'honda', 'BMW'],
  "kms-driven": [25000, 10000, 100000],
  "manufactured date": ['2020-05-06', '2022-03-23', '2018-02-05']
}

#load data into a DataFrame object:
df2 = pd.DataFrame(data)

print(df2) 

In [None]:
df2.info()

In [None]:
df2['manufactured date'] = pd.to_datetime(df2['manufactured date']) # convert the date column from object type to datetime

In [None]:
df2.info()

# 2. Read

### 2.1 Show Top 5 and Bottom 5 Rows

In [None]:
df.head(15)

In [None]:
dictdf.head()

In [None]:
df.tail(15)

### 2.2 Show Columns and Data Type

In [None]:
df.columns

In [None]:
df.dtypes

### 2.3 Summary Statistics

In [None]:
df.describe()

In [None]:
df.describe(include='float64')

### 2.4 Filtering Columns

In [None]:
df.State

In [None]:
df['International plan']

In [None]:
df[['State', 'International plan']]

In [None]:
df.Churn.unique()

### 2.5 Filtering on Rows

In [None]:
df.head()

In [None]:
df1 = df[df['International plan']=='No']

In [None]:
df1

In [None]:
df[(df['International plan']=='No') & (df['Churn']==True)]

### 2.6 Indexing with iloc

In [None]:
df.iloc[14] # to get the 14th row in the dataframe

In [None]:
df.iloc[22:33] # to get the rows between 22 to 33 and this is called slicing

### 2.7 Indexing with loc

In [None]:
state = df.copy()
state.set_index('State', inplace=True) # changing the index to state

In [None]:
state.head()

In [None]:
state.loc['OH']

# 3. Update

### 3.1 Dropping Rows

In [None]:
df.isnull().sum() # to get sum of all null values for a column

In [None]:
df.isnull().any() # to find if any columns has atleast one null value

In [None]:
df.dropna(inplace=True) # to drop the null values

In [None]:
df.isnull().sum()

### 3.2 Dropping Columns

In [None]:
df.drop('Area code', axis=1) # to make changes permanent instead of view assign it to a new variable or to itself

In [None]:
df.columns 

### 3.3 Creating Calculated Columns

In [None]:
df['New Column'] = df['Total night minutes'] + df['Total intl minutes']

In [None]:
df.head()

### 3.4 Updating an Entire Column

In [None]:
df['New Column'] = 100

In [None]:
df.head()

### 3.5 Updating a Single Value

In [None]:
df.iloc[0,-1] = 10 # index starts with 0 and if you want access from last use -1

In [None]:
df.head()

### 3.6 Condition based Updating using Apply

In [None]:
df['Churn Binary'] = df['Churn'].apply(lambda x: 1 if x==True else 0)

In [None]:
df[df['Churn']==True].head()

# 4. Delete/Output

### 4.1 Output to CSV

In [None]:
df.to_csv('output.csv') 

### 4.2 Output to JSON

In [None]:
df.to_json()

### 4.3 Output to HTML

In [None]:
df.to_html()

### 4.4 Delete a DataFrame

In [None]:
del df

In [None]:
df