### OCI Data Science - Useful Tips
<details>
<summary><font size="2">Check for Public Internet Access</font></summary>

```python
import requests
response = requests.get("https://oracle.com")
assert response.status_code==200, "Internet connection failed"
```
</details>
<details>
<summary><font size="2">Helpful Documentation </font></summary>
<ul><li><a href="https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm">Data Science Service Documentation</a></li>
<li><a href="https://docs.cloud.oracle.com/iaas/tools/ads-sdk/latest/index.html">ADS documentation</a></li>
</ul>
</details>
<details>
<summary><font size="2">Typical Cell Imports and Settings for ADS</font></summary>

```python
%load_ext autoreload
%autoreload 2
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

import ads
from ads.dataset.factory import DatasetFactory
from ads.automl.provider import OracleAutoMLProvider
from ads.automl.driver import AutoML
from ads.evaluations.evaluator import ADSEvaluator
from ads.common.data import ADSData
from ads.explanations.explainer import ADSExplainer
from ads.explanations.mlx_global_explainer import MLXGlobalExplainer
from ads.explanations.mlx_local_explainer import MLXLocalExplainer
from ads.catalog.model import ModelCatalog
from ads.common.model_artifact import ModelArtifact
```
</details>
<details>
<summary><font size="2">Useful Environment Variables</font></summary>

```python
import os
print(os.environ["NB_SESSION_COMPARTMENT_OCID"])
print(os.environ["PROJECT_OCID"])
print(os.environ["USER_OCID"])
print(os.environ["TENANCY_OCID"])
print(os.environ["NB_REGION"])
```
</details>

# Pandas
#### Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures
#### and data analysis toosl for the Python programming language.
#### Pandas is must of Exploratory Data Analysis (EDA)
#### Agenda:
* What is Data Frames? --> A DataFrame in Pandas is a __2-dimensional, tabular data structure (like an Excel sheet or SQL table)__ that allows you to store and manipulate data in rows and columns.
* What is Data Series? --> A Series in Pandas is a __one-dimensional labeled array â€” like a single column of data__ from a DataFrame.It is the building block of a DataFrame. Pandas Series = one column of data with labels (index + values).
* Different Operations in Pandas

In [1]:
# import pandas
import pandas as pd
import numpy as np

In [4]:
# Playing with DataFrame
df=pd.DataFrame(np.arange(0,20).reshape(5,4),index=['Row1','Row2','Row3','Row4','Row5'],columns=['Column1','Column2','Column3','Column4'])

In [5]:
df.head()

Unnamed: 0,Column1,Column2,Column3,Column4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


In [6]:
df.to_csv('Test1.csv')

In [7]:
# Accessing the elements 2 ways 
# 1) .loc (focus on row index) 
# 2) .iloc (index location) --> here, focus is on both row & col index 

df.loc['Row1']

Column1    0
Column2    1
Column3    2
Column4    3
Name: Row1, dtype: int64

In [8]:
type(df.loc['Row1'])

pandas.core.series.Series

In [10]:
df.iloc[:,:] # left side rows, right side columns

Unnamed: 0,Column1,Column2,Column3,Column4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


In [11]:
# access the elements from the col2 
df.iloc[:,1]

Row1     1
Row2     5
Row3     9
Row4    13
Row5    17
Name: Column2, dtype: int64

In [12]:
df.iloc[0:3,0:2] #first 3 rows, 2 columns
# in R, the index starts from 1, in Python index starts from 0

Unnamed: 0,Column1,Column2
Row1,0,1
Row2,4,5
Row3,8,9


In [13]:
type(df.iloc[0:3,0:2])

pandas.core.frame.DataFrame

In [14]:
df.iloc[0] # this gives one row

Column1    0
Column2    1
Column3    2
Column4    3
Name: Row1, dtype: int64

In [15]:
df.iloc[:,0] # this gives one column 

Row1     0
Row2     4
Row3     8
Row4    12
Row5    16
Name: Column1, dtype: int64

In [16]:
type(df.iloc[:,0])

pandas.core.series.Series

In [17]:
df.head()

Unnamed: 0,Column1,Column2,Column3,Column4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


In [19]:
df.iloc[:,1:] # all rows but columns from 1 index

Unnamed: 0,Column2,Column3,Column4
Row1,1,2,3
Row2,5,6,7
Row3,9,10,11
Row4,13,14,15
Row5,17,18,19


In [21]:
df.iloc[:,3:]

Unnamed: 0,Column4
Row1,3
Row2,7
Row3,11
Row4,15
Row5,19


In [28]:
# Index values in row and col labels
df1=pd.DataFrame(np.arange(0,20).reshape(5,4),index=['Row1(0)','Row2(1)','Row3(2)','Row4(3)','Row5(4)'],columns=['Col1(0)','Col2(1)','Col3(2)','Col4(3)'])

In [25]:
df1.head()

Unnamed: 0,Col1(0),Col2(1),Col3(2),Col4(3)
Row1(0),0,1,2,3
Row2(1),4,5,6,7
Row3(2),8,9,10,11
Row4(3),12,13,14,15
Row5(4),16,17,18,19


In [29]:
df.iloc[:,3:]

Unnamed: 0,Column4
Row1,3
Row2,7
Row3,11
Row4,15
Row5,19


## Convert DF into arrays with ".values"

In [30]:
df.iloc[:,1:].values

array([[ 1,  2,  3],
       [ 5,  6,  7],
       [ 9, 10, 11],
       [13, 14, 15],
       [17, 18, 19]])

In [31]:
df1.iloc[:,1:].values

array([[ 1,  2,  3],
       [ 5,  6,  7],
       [ 9, 10, 11],
       [13, 14, 15],
       [17, 18, 19]])

In [33]:
df1.iloc[:,1:].values.shape

(5, 3)

In [37]:
# checking null values in DF
df1.isnull().sum()

Col1(0)    0
Col2(1)    0
Col3(2)    0
Col4(3)    0
dtype: int64

In [38]:
df1.head()

Unnamed: 0,Col1(0),Col2(1),Col3(2),Col4(3)
Row1(0),0,1,2,3
Row2(1),4,5,6,7
Row3(2),8,9,10,11
Row4(3),12,13,14,15
Row5(4),16,17,18,19


In [43]:
# COUNTING UNIQUE VALUES ( ONLY FOR COLUMNS :-) ) 
df1['Col1(0)'].value_counts()

Col1(0)
0     1
4     1
8     1
12    1
16    1
Name: count, dtype: int64

In [44]:
df['Column1'].value_counts()


Column1
0     1
4     1
8     1
12    1
16    1
Name: count, dtype: int64

In [45]:
df['Column1'].unique()

array([ 0,  4,  8, 12, 16])

In [46]:
df['Column1']

Row1     0
Row2     4
Row3     8
Row4    12
Row5    16
Name: Column1, dtype: int64

In [47]:
type(df['Column1'])

pandas.core.series.Series

In [50]:
df[['Column1','Column2']] # put inside [] if multiple columns needed

Unnamed: 0,Column1,Column2
Row1,0,1
Row2,4,5
Row3,8,9
Row4,12,13
Row5,16,17


In [49]:
type(df[['Column1','Column2']])

pandas.core.frame.DataFrame