<a href="https://colab.research.google.com/github/rnomadic/ML_Learning/blob/main/MyPythonCode/MasterCard-Interview.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Mastercard Interview Questions**

### **Q1 How will you find all the unique values in a column?**

We can use unique() with column name to print the value. For below example we use this.

`df1['sepal length (cm)'].unique()`

In [2]:
import pandas as pd
import numpy as np
from sklearn import datasets

iris = datasets.load_iris()

df1 = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
                   columns= iris['feature_names'] + ['target'])

"""
here np.c_ is actually concatening the values
Translates slice objects to concatenation along the second axis. 
Example
np.c_[np.array([1,2,3]), np.array([4,5,6])]
array([[1, 4],
       [2, 5],
       [3, 6]])

"""


In [4]:
df1.columns

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)', 'target'],
      dtype='object')

In [5]:
df1['sepal length (cm)'].unique()


array([5.1, 4.9, 4.7, 4.6, 5. , 5.4, 4.4, 4.8, 4.3, 5.8, 5.7, 5.2, 5.5,
       4.5, 5.3, 7. , 6.4, 6.9, 6.5, 6.3, 6.6, 5.9, 6. , 6.1, 5.6, 6.7,
       6.2, 6.8, 7.1, 7.6, 7.3, 7.2, 7.7, 7.4, 7.9])

## **Q2 How will you rename a column?**

We can use rename function like below

`df1.rename(columns={'sepal length (cm)' : 'sepal length'}, inplace=True)`

In [8]:
df1.rename(columns={'sepal length (cm)' : 'sepal length'}, inplace=True)
df1.columns

Index(['sepal length', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)', 'target'],
      dtype='object')

## **Q3. What are various ways to combine two datasets?**

The **concat()** function in pandas is used to append either columns or rows from one DataFrame to another. The concat() function does all the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.


pd.concat(
    objs,
    axis=0,
    join="outer",
    ignore_index=False,
    keys=None,
    levels=None,
    names=None,
    verify_integrity=False,
    copy=True,
)









In [16]:
# First DataFrame
df1 = pd.DataFrame({'id': ['A01', 'A02', 'A03', 'A04'],
                    'Name': ['ABC', 'PQR', 'DEF', 'GHI']},
                   index=[0, 1, 2, 3])
# Second DataFrame
df2 = pd.DataFrame({'id': ['B05', 'B06', 'B07', 'B08'],
                    'Name': ['XYZ', 'TUV', 'MNO', 'JKL']},
                   index=[4, 5, 6, 7])
  
frames = [df1, df2]
result = pd.concat(frames)
display(result)

Unnamed: 0,id,Name
0,A01,ABC
1,A02,PQR
2,A03,DEF
3,A04,GHI
4,B05,XYZ
5,B06,TUV
6,B07,MNO
7,B08,JKL


In [15]:
## Joining
import pandas as pd
  
df1 = pd.DataFrame({'id': ['A01', 'A02', 'A03', 'A04'],
                    'Name': ['ABC', 'PQR', 'DEF', 'GHI']})
  
df3 = pd.DataFrame({'City': ['MUMBAI', 'PUNE', 'MUMBAI', 'DELHI'],
                    'Age': ['12', '13', '14', '12']})
  
# the default behaviour is join='outer'
# inner join
  
result = pd.concat([df1, df3], axis=1, join='inner')
display(result)

Unnamed: 0,id,Name,City,Age
0,A01,ABC,MUMBAI,12
1,A02,PQR,PUNE,13
2,A03,DEF,MUMBAI,14
3,A04,GHI,DELHI,12


## **Q4. How will you load a dataset which is too large in size to hold in memory?**

https://towardsdatascience.com/what-to-do-when-your-data-is-too-big-for-your-memory-65c84c600585

### Technique 1: Lossless Compression
#### Load data by columns


In [18]:
#Import needed library
import pandas as pd
#Dataset
csv = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
#Load entire dataset
data = pd.read_csv(csv)
data.info(verbose=False, memory_usage="deep")
print()
#Load only two columns
df_2col = pd.read_csv(csv , usecols=["county", "cases"])
df_2col.info(verbose=False, memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2502832 entries, 0 to 2502831
Columns: 6 entries, date to deaths
dtypes: float64(2), int64(1), object(3)
memory usage: 526.0 MB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2502832 entries, 0 to 2502831
Columns: 2 entries, county to cases
dtypes: int64(1), object(1)
memory usage: 172.3 MB


### Technique 2: Manipulate datatypes
int8 can store integers from -128 to 127.

int16 can store integers from -32768 to 32767.

int64 can store integers from -9223372036854775808 to 9223372036854775807.

In [19]:
df = pd.read_csv(csv, usecols=["county", "cases"])
df["cases"].memory_usage(index=False, deep=True)

20022656

In [20]:
df = pd.read_csv(csv, usecols=["county", "cases"], dtype={"cases" : "int16"})
df["cases"].memory_usage(index=False, deep=True)

## You can see 4 times reduction

5005664

### Technique 3: Chunking
That is cutting a large dataset into smaller chunks and then processing those chunks individually. After all the chunks have been processed, you can compare the results and calculate the final findings.

In [21]:
import pandas as pd
#Dataset 
csv = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
#Loop different chuncks and get the max of each one 
result = {}
for chunk in pd.read_csv(csv, chunksize=100):
    max_case = chunk["cases"].max()
    max_case_county = chunk.loc[chunk['cases'] == max_case, 'county'].iloc[0]
    result[max_case_county] = max_case
#Display results
print(max(result, key=result.get) , result[max(result, key=result.get)])

Los Angeles 2908425


### Technique 4: Indexing
The above code took about 2 mins to execute.

Chunking is excellent if you need to load your dataset only once, but if you want to load multiple datasets, then indexing is the way to go.

For example, let’s say I want to get the cases for a specific state. In this case, chunking would make sense; I could write a simple function that accomplishes that.

In [22]:
def get_state_info(name):
  csv = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
  return pd.concat(df[df['state']==name] for df in 
                   pd.read_csv(csv, chunksize=100)
                   )

So, my small function loads all the rows in each chunk but only cares about the ones for the state I want. **That leads to significant overhead**. I can avoid having this by using a database next to Pandas. The simplest one I can use is **SQLite.**

And then re-write the get_state_info function

In [None]:
## I first need to load my data frame into an SQLite database.
import sqlite3
csv = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
# Create a new database file:
db = sqlite3.connect("cases.sqlite")
# Load the CSV in chunks:
for c in pd.read_csv(csv, chunksize=100):
    # Append all rows to a new database table
    c.to_sql("cases", db, if_exists="append")
# Add an index on the 'state' column:
db.execute("CREATE INDEX state ON cases(state)") 
db.close()

In [None]:
def get_state_info(name):
  connection = sqlite3.connect("cases.sqlite")
  query = "select * from cases where state= ?"
  values= (name)
  return pd.read_sql_query(query, connection, values)