<img src=https://i.ibb.co/6gCsHd6/1200px-Pandas-logo-svg.png width="700" height="200">


## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#4d77cf; font-size:200%; text-align:center; border-radius:10px 10px;">Pandas DataFrames</p>

<a id="toc"></a>

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Content</p>

* [IMPORTING LIBRARIES NEEDED IN THIS NOTEBOOK](#0)
* [DATA FRAMES](#1)
* [CREATING A DATA FRAME](#2)
    * [Creating a DataFrame Using the Lists of Data & Columns](#2.1)
    * [Creating a DataFrame Using a Numpy Arrays](#2.2)
    * [Creating a DataFrame Using a Dictionary](#2.3)
    * [The Examination of Some Attributes on Data](#2.4)
* [INDEXING, SLICING & SELECTION](#3)    
* [CREATING A NEW COLUMN](#4)    
* [REMOVING COLUMNS](#5)
* [REMOVING ROWS](#6)
* [SELECTING ROWS & COLUMNS USING .loc[ ] & .iloc[ ] ](#7)
* [CONDITIONAL SELECTION](#8)
    * [One Conditional Statement](#8.1)
    * [Two or More Conditional Statements](#8.2)
    * [Conditional Selection Using .loc[ ]](#8.3)
* [reset_index() & set_index()](#9)
* [Multi-Index & Index Hierarchy](#10)
* [Some Other Useful Methods with Iris Dataset](#11)
* [THE END OF THE SESSION-04](#12)

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Importing Libraries Needed in This Notebook</p>

<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

Once you've installed NumPy & Pandas you can import them as a library:

In [252]:
import pandas as pd
import numpy as np

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Data Frames</p>

<a id="1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

A DataFrame is a two-dimensional data container, similar to a Matrix, but which can contain heterogeneous data, and for which symbolic names may be associated with the rows and columns. ``DataFrames`` are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. 

### Why use Pandas?

Data scientists make use of Pandas in Python for its **following advantages**:

- Easily handles missing data
- It uses Series for one-dimensional data structure and DataFrame for multi-dimensional data structure
- It provides an efficient way to slice the data
- It provides a flexible way to merge, concatenate or reshape the data
- It includes a powerful time series tool to work with

In a nutshell, Pandas is a useful library in data analysis. It can be used to perform data manipulation and analysis. Pandas provide powerful and easy-to-use data structures, as well as the means to quickly perform operations on these structures.

[SOURCE01](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html), 
[SOURCE02](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), 
[SOURCE03](https://morioh.com/p/2528ac775b1b), 
[SOURCE04](https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python), 
[SOURCE05](https://www.guru99.com/python-pandas-tutorial.html), 
[SOURCE06](https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm), 
[SOURCE07](https://realpython.com/pandas-dataframe/) &
[SOURCE08](https://towardsdatascience.com/a-simple-guide-to-pandas-dataframes-b125f64e1453)<br>
[VIDEO SOURCE01](https://www.youtube.com/watch?v=zmdjNSmRXF4), 
[VIDEO SOURCE02](https://www.youtube.com/watch?v=F6kmIpWWEdU) &
[VIDEO SOURCE03](https://towardsdatascience.com/pandas-dataframe-basics-3c16eb35c4f3)<br>

**Now let's use pandas to explore this topic!**

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Creating a DataFrame</p>

<a id="2"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

A **``DataFrame``** is a **two-dimension collection of data**. It is a data structure where data is stored in **tabular form**. Datasets are arranged in rows and columns; we can store multiple datasets in the data frame. We can perform various arithmetic operations, such as adding column/row selection and columns/rows in the data frame.

We can import the DataFrames from the external storage; these storages can be referred to as the SQL Database, CSV file, and an Excel file. We can also use the lists, dictionary, and from a list of dictionary, etc.

In this session, we will learn to create the DataFrame in multiple ways. Let's understand these different ways.

**``pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)``**

### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">Creating a DataFrame Using the Lists of Data & Columns</p>

<a id="2.1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

In [184]:
data = [1, 3, 5, 7, 9, 18]
pd.DataFrame(data)

Unnamed: 0,0
0,1
1,3
2,5
3,7
4,9
5,18


In [185]:
pd.DataFrame(data, columns=['column1'])

# Creating a DataFrame by giving a column name from the list 
# (As many columns as there are, as many column names should be entered.)

Unnamed: 0,column1
0,1
1,3
2,5
3,7
4,9
5,18


In [186]:
# Let us remember how we define the name of a Series

pd.Series(data=data, name="column_1")

0     1
1     3
2     5
3     7
4     9
5    18
Name: column_1, dtype: int64

In [187]:
pd.DataFrame(data=data, index=["A", "B", "C", "D", "E", "F"], columns=["Colmn1"]) 

# Creating a DataFrame by giving index and column names from the list 
# (We should enter as many index names as we have rows in dataframe)

Unnamed: 0,Colmn1
A,1
B,3
C,5
D,7
E,9
F,18


### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">Creating a DataFrame Using a Numpy Arrays</p>

<a id="2.2"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

In [77]:
my_data = np.arange(1,24,2).reshape(3,4)

pd.DataFrame(data= my_data, columns = ["var1 var2 var3 var4".split()])

Unnamed: 0,var1,var2,var3,var4
0,1,3,5,7
1,9,11,13,15
2,17,19,21,23


### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">Creating a DataFrame Using a Dictionary</p>

<a id="2.3"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

In [76]:
#dic'ten olusturalim. key'ler sütun ismi olur
s1 = np.random.randint(1,10, size=4)
s2 = np.random.randint(3,10, size=4)
s3 = np.random.randint(4,15, size=4)

myDict = {"var1": s1, "var2": s2, "var3": s3}
myDict

{'var1': array([9, 7, 4, 8]),
 'var2': array([9, 9, 8, 9]),
 'var3': array([13,  6,  6, 14])}

In [78]:
df = pd.DataFrame(myDict)
df

Unnamed: 0,var1,var2,var3
0,9,9,13
1,7,9,6
2,4,8,6
3,8,9,14


In [79]:
df7 = pd.DataFrame({ "First Name": ["Nik", "Evan"],
                    "Last Initial": ["P", "D"],
                    "Age": [30,31],
                    "Height": ["nik@email.com", "evan@email.com"],
                    "Weight": [175,170]
                   })
df7

Unnamed: 0,First Name,Last Initial,Age,Height,Weight
0,Nik,P,30,nik@email.com,175
1,Evan,D,31,evan@email.com,170


In [80]:
df.columns  # sütun isimleri cekmek önemli, ileride kullanacagiz. bu bir list degil dikkat ama, iterable olarak
# kullanacagiz for loopslarda vs vs

Index(['var1', 'var2', 'var3'], dtype='object')

### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">Some Attributes on Data</p>

<a id="2.4"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

# .read_csv(): 

Read a comma-separated values (csv) file into DataFrame

dataframe=pd.read_csv('my_file.csv')

In [None]:
# df = pd.read_csv("titanic.csv")

# .head(): 

This function returns the first n rows for the object based on position, default n=5

In [81]:
import numpy as np
df11 = pd.DataFrame(np.arange(1,24,2).reshape(3,4),
                  columns = ["var1", "var2", "var3", "var4"])
df11

Unnamed: 0,var1,var2,var3,var4
0,1,3,5,7
1,9,11,13,15
2,17,19,21,23


In [82]:
df11.head(2)

Unnamed: 0,var1,var2,var3,var4
0,1,3,5,7
1,9,11,13,15


# .tail(): 

This function returns last n rows from the object based on position, default n=5.

In [84]:
df11.tail(1)

Unnamed: 0,var1,var2,var3,var4
2,17,19,21,23


# .sample : 

Return a random sample of items from an axis of object

In [85]:
df11.sample()

Unnamed: 0,var1,var2,var3,var4
2,17,19,21,23


In [86]:
df11.sample(1, axis = 1)

Unnamed: 0,var1
0,1
1,9
2,17


In [87]:
df12 = pd.DataFrame({'num_legs': [2, 4, 8, 0],
                    'num_wings': [2, 0, 0, 0],
                    'num_specimen_seen': [10, 2, 1, 8]},
                   index=['falcon', 'dog', 'spider', 'fish'])
df12

Unnamed: 0,num_legs,num_wings,num_specimen_seen
falcon,2,2,10
dog,4,0,2
spider,8,0,1
fish,0,0,8


In [89]:
df12.sample()

Unnamed: 0,num_legs,num_wings,num_specimen_seen
fish,0,0,8


In [93]:
# Extract 3 random elements from the ``Series`` ``df['num_legs']``:
# Note that we use `random_state` to ensure the reproducibility of the examples.
df12[['num_legs']].sample(n=3, random_state=1)

Unnamed: 0,num_legs
fish,0
spider,8
falcon,2


# df.describe():

Generate descriptive statistics which include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values

In [197]:
df12.describe()

Unnamed: 0,num_legs,num_wings,num_specimen_seen
count,4.0,4.0,4.0
mean,3.5,0.5,5.25
std,3.41565,1.0,4.425306
min,0.0,0.0,1.0
25%,1.5,0.0,1.75
50%,3.0,0.0,5.0
75%,5.0,0.5,8.5
max,8.0,2.0,10.0


In [198]:
df12["num_legs"].describe()

count    4.00000
mean     3.50000
std      3.41565
min      0.00000
25%      1.50000
50%      3.00000
75%      5.00000
max      8.00000
Name: num_legs, dtype: float64

# .shape: 

Return a tuple representing the dimensionality of the DataFrame.

In [94]:
df11.shape

(3, 4)

In [95]:
df12.shape

(4, 3)

# df.mean(): 

Return the mean of the values over the requested axis.

In [195]:
# femalelerin yas ortalamasını getirelim
data5["age"].loc[data5["gender"] == "F"].mean()

17.333333333333332

In [196]:
# delhi'de yaşayan erkeklerin yaş ortalamasını getirelim
data5["age"].loc[data5["city"] == "Delhi"].loc[data5["gender"] == "M"].mean()

16.0

**4) Dealing with unique values in a column using;**

# 4.1 unique() :

Compute array of unique values in a Series, returned in the order observed

In [2]:
import pandas as pd
pd.Series([2, 1, 3, 3], name='A').unique()

array([2, 1, 3])

In [4]:
df48 = pd.DataFrame({"Company": ["FB", "GGOG", "MSFT", "GOOG", "MSFT", "FB"], 
                    "Person": "Edward Charlie Amy Vanessa Carl Sarah".split(),
                    "Sales": [200,120,340,124,243,350]})
df48

Unnamed: 0,Company,Person,Sales
0,FB,Edward,200
1,GGOG,Charlie,120
2,MSFT,Amy,340
3,GOOG,Vanessa,124
4,MSFT,Carl,243
5,FB,Sarah,350


In [6]:
df48["Company"].unique()

array(['FB', 'GGOG', 'MSFT', 'GOOG'], dtype=object)

# 4.2 nunique(): 

gives the number of unique values

In [7]:
df48["Person"].nunique()

6

In [8]:
df48.nunique()

Company    4
Person     6
Sales      6
dtype: int64

# 5. value_counts() :

Return a Series containing unique values as its index and frequencies as its values, ordered count in descending order 

In [9]:
df48[["Company"]].value_counts()

Company
FB         2
MSFT       2
GGOG       1
GOOG       1
dtype: int64

# 6. sort_values() 

This method sorts a DataFrame by a column. It accepts a 'by' argument with column name.

In [10]:
df48.Company.sort_values()

0      FB
5      FB
1    GGOG
3    GOOG
2    MSFT
4    MSFT
Name: Company, dtype: object

In [11]:
df48.Sales.sort_values()
# for a permanent sort inplace = True
# to update index, ignore_,ndex = True

1    120
3    124
0    200
4    243
2    340
5    350
Name: Sales, dtype: int64

# str.split, str.rsplit, str.lsplit

In [1]:
# hotels.name.str.rsplit(n=1, expand= True).iloc[:,1].value_counts().head()

# hotels["phone-number"].str.split("-", n=1, expand= True).iloc[:,0].value_counts().head(3)

![image.png](attachment:30a280e1-cecf-4821-b74c-0ddb95e0e01e.png)

![image.png](attachment:cb362632-d63e-471f-bf50-044e40d95af9.png)

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Indexing, Slicing & Selection</p>

<a id="3"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

Let's learn a variety of methods to grab data from a DataFrame

In [12]:
from numpy.random import randn

In [98]:
# creating a DataFrame by "keyword arguments"

np.random.seed(101)
df = pd.DataFrame(randn(5, 4), index = ['A', 'B', 'C', 'D', 'E'], columns = ['W', 'X', 'Y', 'Z'])
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


# df.index : 

The basic object storing axis labels for all pandas objects.

In [99]:
df.index

Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

In [100]:
# normal df["Y"] series getirir. eger df olarak gelsin istiyorsak [[]]
df[["Y"]]
# birden fazla sütunu getirmek istersek list icinde yazmalıyız df[["Y", "W"]]. cunku 1den fazlası serie olmaz, df olur

Unnamed: 0,Y
A,0.907969
B,-0.848077
C,0.528813
D,-0.933237
E,2.605967


In [101]:
# index labelları degistirme
df.index = ["z y g k m".split()]
df

Unnamed: 0,W,X,Y,Z
z,2.70685,0.628133,0.907969,0.503826
y,0.651118,-0.319318,-0.848077,0.605965
g,-2.018168,0.740122,0.528813,-0.589001
k,0.188695,-0.758872,-0.933237,0.955057
m,0.190794,1.978757,2.605967,0.683509


# df.columns : 

The column labels of the DataFrame.

In [103]:
df.columns

Index(['W', 'X', 'Y', 'Z'], dtype='object')

In [108]:
# sütunların isimlerini degistirme
df = pd.DataFrame(myDict)
df

Unnamed: 0,var1,var2,var3
0,9,9,13
1,7,9,6
2,4,8,6
3,8,9,14


In [109]:
# tuple unpacking yontemi kullanabiliriz
df.columns = "new1", "new2", "new3"
df

Unnamed: 0,new1,new2,new3
0,9,9,13
1,7,9,6
2,4,8,6
3,8,9,14


In [110]:
# or
df.rename(index={0: "yeni", 1:"yeni2"})
df

Unnamed: 0,new1,new2,new3
0,9,9,13
1,7,9,6
2,4,8,6
3,8,9,14


In [261]:
# how to change column names
people = {
    "first": ["Corey", 'Jane', 'John'], 
    "last": ["Schafer", 'Doe', 'Doe'], 
    "email": ["CoreyMSchafer@gmail.com", 'JaneDoe@email.com', 'JohnDoe@email.com']
}
df23 = pd.DataFrame(people)
print(df23.columns)

df23.columns = [i.upper() for i in df23.columns]
print(df23.columns)
print("-----"*15)
df23.columns = [i.replace("S", "+") for i in df23.columns]
print(df23.columns)

Index(['first', 'last', 'email'], dtype='object')
Index(['FIRST', 'LAST', 'EMAIL'], dtype='object')
---------------------------------------------------------------------------
Index(['FIR+T', 'LA+T', 'EMAIL'], dtype='object')


In [262]:
df23

Unnamed: 0,FIR+T,LA+T,EMAIL
0,Corey,Schafer,CoreyMSchafer@gmail.com
1,Jane,Doe,JaneDoe@email.com
2,John,Doe,JohnDoe@email.com


In [264]:
# sofar we've modified all the columns together. how to change them separately

df23.rename(columns= {"FIR+T" : "first", "LA+T" : "last"}, inplace = True)
df23

Unnamed: 0,first,last,EMAIL
0,Corey,Schafer,CoreyMSchafer@gmail.com
1,Jane,Doe,JaneDoe@email.com
2,John,Doe,JohnDoe@email.com


# df.col.sum(): 

Return the sum of the values over the requested axis.

In [191]:
data5 = pd.DataFrame({
    'age' :     [ 10, 22, 13, 21, 12, 11, 17],
    'section' : [ 'A', 'B', 'C', 'B', 'B', 'A', 'A'],
    'city' :    [ 'Gurgaon', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai'],
    'gender' :  [ 'M', 'F', 'F', 'M', 'M', 'M', 'F'],
    'favourite_color' : [ 'red', 'blue', 'yellow', 'pink', 'black', 'green', 'red']
})

data5

Unnamed: 0,age,section,city,gender,favourite_color
0,10,A,Gurgaon,M,red
1,22,B,Delhi,F,blue
2,13,C,Mumbai,F,yellow
3,21,B,Delhi,M,pink
4,12,B,Mumbai,M,black
5,11,A,Delhi,M,green
6,17,A,Mumbai,F,red


In [192]:
# delhi'de yaşayan erkeklerin yaşlarının toplamı

data5["age"].loc[(data5["city"]=="Delhi") & (data5["gender"]== "M")].sum()

32

In [193]:
# df.col.unique():
# Hash table-based unique. Uniques are returned in order of appearance. This does NOT sort
data5["city"].unique()
# bunun type'i object. eger bir list yapmak istersek .tolist() kullanabiliriz. aşağıda


array(['Gurgaon', 'Delhi', 'Mumbai'], dtype=object)

In [194]:
print(data5["city"].unique().tolist())

# bunu variblea da atayabiliriz
unique_cities = data5["city"].unique().tolist()
unique_cities

['Gurgaon', 'Delhi', 'Mumbai']


['Gurgaon', 'Delhi', 'Mumbai']

# df.shape : 

Return a tuple representing the dimensionality of the DataFrame.

In [111]:
df.shape

(4, 3)

In [112]:
print(df.shape[0])  # row numbers
print(df.shape[1])  # column numbers

4
3


In [113]:
len(df)  # df len'i satır sayısıdır her zaman

4

# df.size : 

Return an int representing the number of elements in this object.

In [115]:
df.size

12

# df.ndim : 

Return an int representing the number of axes / array dimensions.

In [116]:
df12.ndim

2

**in operator for series and df**

In [118]:
np.random.seed(101)
df = pd.DataFrame(randn(5, 4), index = ['A', 'B', 'C', 'D', 'E'], columns = ['W', 'X', 'Y', 'Z'])
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [119]:
# series'de in operatoru sadece indexe bakar; df'te ise column ana unsurdur, columnda arar
print("W" in df)
"C" in df  

True


False

In [120]:
# ama direkt index ici arama yaparak bulabliriz
"C" in df.index

True

In [121]:
"C" in df.W  # bir sutunda label olarak var mi diye de bakabiliriz

True

# df.isnull().sum():

Detect missing values. Return a total number of missing values in each column.

In [188]:
data6 = pd.DataFrame({
    'age' :     [ 10, 22, np.NaN, 21, np.NaN, np.NaN, 17],  # 3ünü non value yaptım
    'section' : [ 'A', 'B', 'C', 'B', 'B', 'A', 'A'],
    'city' :    [ 'Gurgaon', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai'],
    'gender' :  [ 'M', 'F', 'F', 'M', 'M', 'M', 'F'],
    'favourite_color' : [ 'red', 'blue', 'yellow', 'pink', 'black', 'green', 'red']
})

data6

Unnamed: 0,age,section,city,gender,favourite_color
0,10.0,A,Gurgaon,M,red
1,22.0,B,Delhi,F,blue
2,,C,Mumbai,F,yellow
3,21.0,B,Delhi,M,pink
4,,B,Mumbai,M,black
5,,A,Delhi,M,green
6,17.0,A,Mumbai,F,red


In [189]:
data6.isnull()  # 0'ları true olarak gösterir. eğer sütun basina sayılarını gormek istersem .sum() eklerim

Unnamed: 0,age,section,city,gender,favourite_color
0,False,False,False,False,False
1,False,False,False,False,False
2,True,False,False,False,False
3,False,False,False,False,False
4,True,False,False,False,False
5,True,False,False,False,False
6,False,False,False,False,False


In [190]:
data6.isnull().sum()

age                3
section            0
city               0
gender             0
favourite_color    0
dtype: int64

# df.col.value_counts(): 

Return a Series containing counts of unique values.

to disply the frequincies. to see the frequency relevance : value_counts(normalize=True)

In [199]:
data6.value_counts()

age   section  city     gender  favourite_color
10.0  A        Gurgaon  M       red                1
17.0  A        Mumbai   F       red                1
21.0  B        Delhi    M       pink               1
22.0  B        Delhi    F       blue               1
dtype: int64

In [200]:
data6.city.value_counts()

Delhi      3
Mumbai     3
Gurgaon    1
Name: city, dtype: int64

In [201]:
# şöyle de yazılabilirdi:

data6["city"].value_counts()

Delhi      3
Mumbai     3
Gurgaon    1
Name: city, dtype: int64

In [202]:
data6["age"].value_counts(normalize = True)

10.0    0.25
22.0    0.25
21.0    0.25
17.0    0.25
Name: age, dtype: float64

In [203]:
index4 = pd.Index([3, 1, 2, 3, 4, np.nan])
print(index4.value_counts())
print()
print(index4.value_counts(normalize = True))
print()
print(index4.value_counts(dropna = False))  # dropna default true. False yapınca nan'ları da dahil eder.

3.0    2
1.0    1
2.0    1
4.0    1
dtype: int64

3.0    0.4
1.0    0.2
2.0    0.2
4.0    0.2
dtype: float64

3.0    2
1.0    1
2.0    1
4.0    1
NaN    1
dtype: int64


## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Creating a New Column</p>

<a id="4"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

In [226]:
# creating a new column
np.random.seed(101)

df = pd.DataFrame(np.random.randint(9, size=(5,4)),
                  index = "A B C D E".split(),
                  columns = "W X Y Z".split())
df

Unnamed: 0,W,X,Y,Z
A,1,6,7,8
B,4,8,5,0
C,5,8,1,3
D,8,3,3,2
E,8,3,7,0


In [123]:
df["new1"] = df["X"] * df["Y"]

df
# mesela dolar degerli bir sutunun tamamını current tl kuruyla carpmak icin bu carpim islemi

Unnamed: 0,W,X,Y,Z,new1
A,1,6,7,8,42
B,4,8,5,0,40
C,5,8,1,3,8
D,8,3,3,2,9
E,8,3,7,0,21


In [124]:
# ya da sıfırdan sütün olusturma
df["new2"] = np.arange(5)  # satır sayısı df len ile aynı olmalı
df

Unnamed: 0,W,X,Y,Z,new1,new2
A,1,6,7,8,42,0
B,4,8,5,0,40,1
C,5,8,1,3,8,2
D,8,3,3,2,9,3
E,8,3,7,0,21,4


# df.reset_index : 

Reset the index of the DataFrame, and use the default one instead. If the DataFrame has a MultiIndex, this method can remove one or more levels.

In [125]:
# reset to default 0,1,2,...n

df.reset_index()
# inplace default = False. True yaparsak kalıcı olur

Unnamed: 0,index,W,X,Y,Z,new1,new2
0,A,1,6,7,8,42,0
1,B,4,8,5,0,40,1
2,C,5,8,1,3,8,2
3,D,8,3,3,2,9,3
4,E,8,3,7,0,21,4


In [126]:
df.index  # hasn't changed

Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

# df.set_index(): 

Set the DataFrame index using existing columns.

In [127]:
newindex = "CA NY WY OR CO".split()
newindex

['CA', 'NY', 'WY', 'OR', 'CO']

In [128]:
df["newidx"] = newindex
df

Unnamed: 0,W,X,Y,Z,new1,new2,newidx
A,1,6,7,8,42,0,CA
B,4,8,5,0,40,1,NY
C,5,8,1,3,8,2,WY
D,8,3,3,2,9,3,OR
E,8,3,7,0,21,4,CO


In [129]:
df.set_index("newidx")

Unnamed: 0_level_0,W,X,Y,Z,new1,new2
newidx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
CA,1,6,7,8,42,0
NY,4,8,5,0,40,1
WY,5,8,1,3,8,2
OR,8,3,3,2,9,3
CO,8,3,7,0,21,4


# How to display if there are any NaN in a column or entire df: isnull and hasnans

In [227]:
df.empty
# True if Series/DataFrame is entirely empty (no items), meaning any of the axes are of length 0.

False

In [228]:
df.Z.hasnans
# z sitununda NaN var mı

False

In [230]:
# df icin hasnans degil isnull kullanılır
df.isnull().any()
# any olmazsa 16bin satır verir, gormek imkansiz.any() ile goruruz. sum() 
# ile de sutun ozelinde missing value sayısı gorulur

W    False
X    False
Y    False
Z    False
dtype: bool

In [232]:
df.isnull().sum()

W    0
X    0
Y    0
Z    0
dtype: int64

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Removing Columns</p>

<a id="5"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

# df.drop()

DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

The drop() method removes the specified row or column. By specifying the column axis ( axis='columns' ), the drop() method removes the specified column. By specifying the row axis ( axis='index' ), the drop() method removes the specified row.

In [130]:
df.drop('new2', axis=1)

# We deleted a column ("new2") from DataFrame. The axis parameter must be set to 1; otherwise, the column will not be found. 
# The change is not permanent. To make it permanent, the inplace parameter must be set to True.

Unnamed: 0,W,X,Y,Z,new1,newidx
A,1,6,7,8,42,CA
B,4,8,5,0,40,NY
C,5,8,1,3,8,WY
D,8,3,3,2,9,OR
E,8,3,7,0,21,CO


In [131]:
df

Unnamed: 0,W,X,Y,Z,new1,new2,newidx
A,1,6,7,8,42,0,CA
B,4,8,5,0,40,1,NY
C,5,8,1,3,8,2,WY
D,8,3,3,2,9,3,OR
E,8,3,7,0,21,4,CO


In [132]:

df.drop(["new1", "new2"], axis=1)

# To delete more than one column from DataFrame, column names must be written as a list.

Unnamed: 0,W,X,Y,Z,newidx
A,1,6,7,8,CA
B,4,8,5,0,NY
C,5,8,1,3,WY
D,8,3,3,2,OR
E,8,3,7,0,CO


In [133]:
df

Unnamed: 0,W,X,Y,Z,new1,new2,newidx
A,1,6,7,8,42,0,CA
B,4,8,5,0,40,1,NY
C,5,8,1,3,8,2,WY
D,8,3,3,2,9,3,OR
E,8,3,7,0,21,4,CO


In [134]:
df.drop(columns=["new1", "new2"])

# We do not need to specify an axis when we give the column names as keyword arg.

Unnamed: 0,W,X,Y,Z,newidx
A,1,6,7,8,CA
B,4,8,5,0,NY
C,5,8,1,3,WY
D,8,3,3,2,OR
E,8,3,7,0,CO


In [135]:
# It will NOT be permanent, unless inplace=True specified!

df.drop(["new1", "new2"], axis=1, inplace=True)
df

Unnamed: 0,W,X,Y,Z,newidx
A,1,6,7,8,CA
B,4,8,5,0,NY
C,5,8,1,3,WY
D,8,3,3,2,OR
E,8,3,7,0,CO


## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Removing Rows</p>

<a id="6"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

In [136]:
df.drop('C', axis=0)

# We deleted a row ("C") from DataFrame. 
# Even if the axis parameter is be set to 0, the row(s) will be found and deleted since the default parameter is axis=0. 


Unnamed: 0,W,X,Y,Z,newidx
A,1,6,7,8,CA
B,4,8,5,0,NY
D,8,3,3,2,OR
E,8,3,7,0,CO


In [137]:
df.drop(index=['B'])

# No need to specify axis when index parameter is defined

Unnamed: 0,W,X,Y,Z,newidx
A,1,6,7,8,CA
C,5,8,1,3,WY
D,8,3,3,2,OR
E,8,3,7,0,CO


In [138]:
# the default value of axis is 0 (axis=0)

df_temp = df.drop('C', axis=0)
df_temp

Unnamed: 0,W,X,Y,Z,newidx
A,1,6,7,8,CA
B,4,8,5,0,NY
D,8,3,3,2,OR
E,8,3,7,0,CO


In [139]:
df

Unnamed: 0,W,X,Y,Z,newidx
A,1,6,7,8,CA
B,4,8,5,0,NY
C,5,8,1,3,WY
D,8,3,3,2,OR
E,8,3,7,0,CO


In [14]:
import numpy as np
import pandas as pd

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Selecting Rows and Columns using .loc[ ] and iloc[ ]</p>

## 1) **``Using Pandas.DataFrame.loc[]``** (By label)


**1.1 – Slicing Columns by Names or Labels**

By using **``pandas.DataFrame.loc[ ]``** you can slice columns by names or labels. To slice the columns, the syntax is **``df.loc[:, start:stop:step]``**; where start is the name of the first column to take, stop is the name of the last column to take, and step as the number of indices to advance after each extraction

**1.2 – Slicing DataFrame Columns by Labels**

To slice DataFrame columns by labels or names, all you need is to provide the multiple labels you wanted to slice as a list. Here we use the list of labels instead of the start:stop:step approach.

**1.3 – Slicing DataFrame Columns by Range**

When you wanted to slice a DataFrame by the range of columns, provide start and stop column names.

  - By not providing a start column, loc[] selects from the beginning.
  - By not providing stop, loc[] selects all columns from the start label.
  - Providing both start and stop, selects all columns in between.

**1.4 – Slicing Certain Selective Columns in pandas**

Sometimes you may want to select random certain columns from pandas DataFrame, you can do this by passing selected column names/labels as a list.

## 2) **``Using Pandas.DataFrame.iloc[]``** (By position)

By using **``pandas.DataFrame.iloc[ ]``** you can slice DataFrame by column **position/index**. Always remember that index starts from 0. You can use **``pandas.DataFrame.iloc[ ]``** with the syntax **``[:, start:stop:step]``**; where **start** indicates the index of the first column to take, **stop** indicates the index of the last column to take, and **step** indicates the number of indices to advance after each extraction. Or, use the syntax: **``[:, [indices]]``** with indices as a list of column indices to take.

**2.1 – Slicing Columns by Index Position**

We are going to use columns by their index positions, and retrieve slices of DataFrame. Below example retrieves "country isocode", "POP" and "XRAT" slices of columns at the DataFrame.

**2.2 Column Slices by Position Range**

Like slices by column labels, you can also slice a DataFrame by a range of positions.



<a id="7"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

#### `.loc[]` → allows us to select data using **labels** (names) of rows (index) & columns

#### `.iloc[]` → allows us to select data using **index numbers** of rows (index) & columns. it's like classical indexing logic

![image.png](attachment:a8b3aa01-538f-4cf1-b129-d055c50c7349.png)


In [15]:
data = np.random.randint(1, 40, size=(8, 4))

df = pd.DataFrame(data, columns = ["var1", "var2", "var3", 'var4'])
df

Unnamed: 0,var1,var2,var3,var4
0,35,2,21,24
1,32,4,1,17
2,1,31,19,27
3,4,34,25,12
4,31,2,23,32
5,27,20,18,8
6,24,6,34,10
7,20,35,39,38


In [16]:
df.loc[4]

var1    31
var2     2
var3    23
var4    32
Name: 4, dtype: int64

In [17]:
# df olarak gormek istiyorsak
df.loc[[4]]
# Returns the observation at the 4th index --> used as loc[row, col]

Unnamed: 0,var1,var2,var3,var4
4,31,2,23,32


In [6]:
# slicing
df.loc[2 : 5]
# Slicing at hand returns the observations between 2 (inclusive) and 5 (inclusive) indexis. 
# The last index, 5, is INCLUSIVE at loc[]

Unnamed: 0,var1,var2,var3,var4
2,30,17,37,16
3,21,15,17,6
4,23,29,36,9
5,12,39,6,28


In [7]:
# slicing: iloc ise 5i getirmez cunku klasik indexleme yapiyor. ornek labellar int oldugu icin karisiyor
df.iloc[2 : 5]

# Slicing at hand returns the observations between 2 (inclusive) and 5 (exclusive) indexis. 
# The last index, 5, is EXCLUSIVE at iloc[]

Unnamed: 0,var1,var2,var3,var4
2,30,17,37,16
3,21,15,17,6
4,23,29,36,9


In [19]:
df.index = "a b c d e f g h".split()
df
# We changed index labels. The length of the assigned values should be the same as the df's length.

Unnamed: 0,var1,var2,var3,var4
a,35,2,21,24
b,32,4,1,17
c,1,31,19,27
d,4,34,25,12
e,31,2,23,32
f,27,20,18,8
g,24,6,34,10
h,20,35,39,38


In [10]:
df.iloc[2:5]

Unnamed: 0,var1,var2,var3,var4
c,30,17,37,16
d,21,15,17,6
e,23,29,36,9


In [11]:
df.loc["c":"e"]
# If the index values are str (labeled), you can reach the rows you want by using loc[] and slicing in this way.

Unnamed: 0,var1,var2,var3,var4
c,30,17,37,16
d,21,15,17,6
e,23,29,36,9


**index no rakam ise iloc; label ise loc**

In [12]:
# sütunları da ise katalım comma ile
df.loc["d", "var3"]
# Returns the value from the intersection of the column of "var3" and row "d"

17

In [21]:
df.loc["d":"g", "var2"]

d    34
e     2
f    20
g     6
Name: var2, dtype: int64

In [None]:
df.loc["d":"g"]["var2"]

d    15
e    29
f    39
g    24
Name: var2, dtype: int64

In [15]:
# sutunu df olarak geitelim
df.loc["d":"g"][["var2"]]

Unnamed: 0,var2
d,15
e,29
f,39
g,24


In [16]:
df.loc["d":"g", ["var2"]]

Unnamed: 0,var2
d,15
e,29
f,39
g,24


In [17]:
df.iloc[2:5, 2]

c    37
d    17
e    36
Name: var3, dtype: int64

In [19]:
df.iloc[2:5, [2]]  # tamamı degil [] icine, sutun sadece. seriesi df yapacak olan sutun cunku

Unnamed: 0,var3
c,37
d,17
e,36


In [23]:
# df.iloc[2:5][2] error verir
# df.iloc[2:5][[2]] bu da error verir
# iloc df.iloc[2:5] 'te biter. onun yanına getirilen 2, df[[2]] demek olur ki bu hata verir
# label girersek olur sadece
df.iloc[2:5]["var3"]

c    37
d    17
e    36
Name: var3, dtype: int64

In [24]:
df.iloc[2:5][["var3"]]

Unnamed: 0,var3
c,37
d,17
e,36


In [26]:
df.loc["a", ["var1"]]  # seri getirdi

var1    4
Name: a, dtype: int64

In [27]:
df.loc[["a"], ["var1"]]  # tek hücrelik veri dahi df olarak getirilebilir

Unnamed: 0,var1
a,4


**fancy indexing**

In [28]:
df.loc[["a", "c"], ["var1", "var3"]] 

Unnamed: 0,var1,var3
a,4,11
c,30,37


In [204]:
mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
...           {'a': 100, 'b': 200, 'c': 300, 'd': 400},
...           {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]
>>> df = pd.DataFrame(mydict)
>>> df

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,100,200,300,400
2,1000,2000,3000,4000


In [205]:
# Indexing just the rows: With a scalar integer.

print(type(df.iloc[0]))

df.iloc[0]  # tek [] pd.Series olarak getirir

<class 'pandas.core.series.Series'>


a    1
b    2
c    3
d    4
Name: 0, dtype: int64

In [206]:
# With a list of integers.

print(df.iloc[[0]])  # [] icinde [] ile pd.dataframe type olarak getirir.
type(df.iloc[[0]])

   a  b  c  d
0  1  2  3  4


pandas.core.frame.DataFrame

In [207]:
df.iloc[[0, 1]]  # hem 0 hem de 1 nolu rowları getir. [:2] de getirir bunu
# coklu satır getirirken knc br [] icinde vermeyz int indexlern.

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,100,200,300,400


In [208]:
df.iloc[:2]

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,100,200,300,400


In [209]:
df.iloc[[0, 2], [1, 3]]  # 0 ve 2 nolu satırların 1 ve 3 nolu sutunlarını getir.

Unnamed: 0,b,d
0,2,4
2,2000,4000


In [210]:
# slicelayarak getir
df.iloc[1:3, 0:3]  # 1 ve 2nin 0,1 ve 2. columnları getir. ONCE ROW SONRA COLUMN. mesela ik 2 sıranın son 4 columnu 
# icin df.iloc[:3, -4:]

Unnamed: 0,a,b,c
1,100,200,300
2,1000,2000,3000


In [211]:
# sadece cift numaralı satırları getir.
df.iloc[lambda x: x.index % 2 == 0]


# ek bilgi: df.iloc[:5] = df.head()
# df.iloc[-5:] = df.tail()

Unnamed: 0,a,b,c,d
0,1,2,3,4
2,1000,2000,3000,4000


## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Conditional Selection</p>

<a id="8"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [29]:
df

Unnamed: 0,var1,var2,var3,var4
a,4,31,11,4
b,19,37,30,34
c,30,17,37,16
d,21,15,17,6
e,23,29,36,9
f,12,39,6,28
g,33,24,4,38
h,9,32,13,33


### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">One Conditional Statement</p>

<a id="8.1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

In [30]:
df>10

Unnamed: 0,var1,var2,var3,var4
a,False,True,True,False
b,True,True,True,True
c,True,True,True,True
d,True,True,True,False
e,True,True,True,False
f,True,True,False,True
g,True,True,False,True
h,False,True,True,True


In [31]:
df[df>10]  # False'lar NaN gelir.serie'de true olanları getirip falseları hic getirmiyordu.

Unnamed: 0,var1,var2,var3,var4
a,,31,11.0,
b,19.0,37,30.0,34.0
c,30.0,17,37.0,16.0
d,21.0,15,17.0,
e,23.0,29,36.0,
f,12.0,39,,28.0
g,33.0,24,,38.0
h,,32,13.0,33.0


In [34]:
# tek sutuna bakalım
df[df["var1"] > 10]  # var1in 10dan büyük oldugu satırları getirdi. cunku df["var1"] satır demek. ,"sutun" ile 
# filtreleme yaparsak tek sutun
# bunu variable atayarak da yapabiliriz
# filt = df["var1"]>10    df.loc[filt, "var1"]

Unnamed: 0,var1,var2,var3,var4
b,19,37,30,34
c,30,17,37,16
d,21,15,17,6
e,23,29,36,9
f,12,39,6,28
g,33,24,4,38


In [38]:
# tek sutunu nasıl getiririz
# var1de 10dan buyuk olanları iceren satırlardan olusan var 2 sutunu
df[df["var1"] > 10][["var2"]]

Unnamed: 0,var2
b,37
c,17
d,15
e,29
f,39
g,24


In [39]:
df[df["var1"] > 10][["var2", "var3"]]  # 2 sutunda gorelim

Unnamed: 0,var2,var3
b,37,30
c,17,37
d,15,17
e,29,36
f,39,6
g,24,4


### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">Two or More Conditional Statements</p>

<a id="8.2"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

**For two or more conditions, you can use | → or, & → and with parenthesis:**

In [40]:
(df["var1"] > 10) & (df["var1"] < 20)

a    False
b     True
c    False
d    False
e    False
f     True
g    False
h    False
Name: var1, dtype: bool

In [41]:
df[(df["var1"] > 10) & (df["var1"] < 20)]

Unnamed: 0,var1,var2,var3,var4
b,19,37,30,34
f,12,39,6,28


In [42]:
# tek sutun icin
df[(df["var1"] > 10) & (df["var1"] < 20)][["var4"]]
# bu conditionları sağlayan var4 sutununu getir
# örn: 10k-20k km arası aracların marka sutununu (var4) getir.

Unnamed: 0,var4
b,34
f,28


In [73]:
data5 = pd.DataFrame({
    'age' :     [ 10, 22, 13, 21, 12, 11, 17],
    'section' : [ 'A', 'B', 'C', 'B', 'B', 'A', 'A'],
    'city' :    [ 'Gurgaon', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai'],
    'gender' :  [ 'M', 'F', 'F', 'M', 'M', 'M', 'F'],
    'favourite_color' : [ 'red', 'blue', 'yellow', 'pink', 'black', 'green', 'red']
})

data5

Unnamed: 0,age,section,city,gender,favourite_color
0,10,A,Gurgaon,M,red
1,22,B,Delhi,F,blue
2,13,C,Mumbai,F,yellow
3,21,B,Delhi,M,pink
4,12,B,Mumbai,M,black
5,11,A,Delhi,M,green
6,17,A,Mumbai,F,red


In [74]:
data5[data5["city"].isin(["Gurgaon","Delhi"])]  # T-F yerine belli conditionları karşılayan tüm verileri
# yazdırmak için de kullanabiliriz.

Unnamed: 0,age,section,city,gender,favourite_color
0,10,A,Gurgaon,M,red
1,22,B,Delhi,F,blue
3,21,B,Delhi,M,pink
5,11,A,Delhi,M,green


### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">Conditional Selection Using .loc[ ] and .iloc[ ]</p>

<a id="8.3"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

In [43]:
df.loc[["a"]]

Unnamed: 0,var1,var2,var3,var4
a,4,31,11,4


In [45]:
df.loc["a":"f"]

Unnamed: 0,var1,var2,var3,var4
a,4,31,11,4
b,19,37,30,34
c,30,17,37,16
d,21,15,17,6
e,23,29,36,9
f,12,39,6,28


In [46]:
df.loc[["a","f"]]

Unnamed: 0,var1,var2,var3,var4
a,4,31,11,4
f,12,39,6,28


In [47]:
df.loc[["a","f"], ["var1", "var4"]]

Unnamed: 0,var1,var4
a,4,4
f,12,28


In [48]:
df.loc[(df["var1"]>10), ["var2", "var3"] ]  # satırda condition, sutunda fancy indexing yaptık

Unnamed: 0,var2,var3
b,37,30
c,17,37
d,15,17
e,29,36
f,39,6
g,24,4


In [54]:
# birden fazla cond yapalım
df.loc[(df["var1"]>10) | (df["var1"]<30), ["var1", "var3"]]

Unnamed: 0,var1,var3
a,4,11
b,19,30
c,30,37
d,21,17
e,23,36
f,12,6
g,33,4
h,9,13


## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">reset_index() & set_index()</p>

<a id="9"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!

In [50]:
df

Unnamed: 0,var1,var2,var3,var4
a,4,31,11,4
b,19,37,30,34
c,30,17,37,16
d,21,15,17,6
e,23,29,36,9
f,12,39,6,28
g,33,24,4,38
h,9,32,13,33


In [52]:
df.reset_index()  # indexler default hale geldi
# reset_index() method resets the index of the DataFrame, and use the default one instead. 
# If the DataFrame has a MultiIndex, this method can remove one or more levels.

Unnamed: 0,index,var1,var2,var3,var4
0,a,4,31,11,4
1,b,19,37,30,34
2,c,30,17,37,16
3,d,21,15,17,6
4,e,23,29,36,9
5,f,12,39,6,28
6,g,33,24,4,38
7,h,9,32,13,33


In [55]:
# ama index sutunu halen duruyor. kaldıralım
df.reset_index(drop=True)
# kalıcı hale getirmek icin 2 yol: ya inplace= true ya da variable'a atama (df =)
# If we do not want to see the old index as a new column, we can set the drop parameter to True

Unnamed: 0,var1,var2,var3,var4
0,4,31,11,4
1,19,37,30,34
2,30,17,37,16
3,21,15,17,6
4,23,29,36,9
5,12,39,6,28
6,33,24,4,38
7,9,32,13,33


In [57]:
# set_index index olmasını istedigimiz sutunu indexe alır
df.set_index("var4")
# set_index() method set the DataFrame index (row labels) using one or more existing columns or arrays (of the correct length). 
# The index can replace the existing index or expand on it.

Unnamed: 0_level_0,var1,var2,var3
var4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4,4,31,11
34,19,37,30
16,30,17,37
6,21,15,17
9,23,29,36
28,12,39,6
38,33,24,4
33,9,32,13


## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Multi-Index & Index Hierarchy</p>

<a id="10"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:

In [265]:
# Index Levels

outside = ['M1', 'M1', 'M1', 'M2', 'M2', 'M2','M3', 'M3', 'M3']
inside = [1, 2, 3, 1, 2, 3, 5, 6, 7]
multi_index = list(zip(outside, inside))
multi_index

[('M1', 1),
 ('M1', 2),
 ('M1', 3),
 ('M2', 1),
 ('M2', 2),
 ('M2', 3),
 ('M3', 5),
 ('M3', 6),
 ('M3', 7)]

In [266]:
pd.MultiIndex.from_tuples(multi_index)

MultiIndex([('M1', 1),
            ('M1', 2),
            ('M1', 3),
            ('M2', 1),
            ('M2', 2),
            ('M2', 3),
            ('M3', 5),
            ('M3', 6),
            ('M3', 7)],
           )

In [267]:
hier_index = pd.MultiIndex.from_tuples(multi_index)

In [268]:
np.random.seed(101)
df = pd.DataFrame(np.random.randn(9, 4), index = hier_index, columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,Unnamed: 1,A,B,C,D
M1,1,2.70685,0.628133,0.907969,0.503826
M1,2,0.651118,-0.319318,-0.848077,0.605965
M1,3,-2.018168,0.740122,0.528813,-0.589001
M2,1,0.188695,-0.758872,-0.933237,0.955057
M2,2,0.190794,1.978757,2.605967,0.683509
M2,3,0.302665,1.693723,-1.706086,-1.159119
M3,5,-0.134841,0.390528,0.166905,0.184502
M3,6,0.807706,0.07296,0.638787,0.329646
M3,7,-0.497104,-0.75407,-0.943406,0.484752


**``Note``** that all of the MultiIndex constructors accept a names argument which stores string names for the levels themselves. If no names are provided, None will be assigned:

For more information Indexing and Selecting Data, visit [**Pandas Official Documentation**](https://pandas.pydata.org/pandas-docs/version/0.13.0/indexing.html)

In [72]:
df.index.names  # ismi yok demek altta cikan.
# We wanted to see the names of the index columns, but it gave None because there is no name at the moment.

FrozenList(['Group', 'Num'])

In [146]:
df.index.names = ['Group', 'Num']
df.index.names

# We have assigned index names. Respectively, "Group" for the outer level and "Name" for the
# inner level were given as the index name.

FrozenList(['Group', 'Num'])

In [147]:
# After this assignment, let's check the names of each Multiindex level in the DataFrame
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
M1,1,2.70685,0.628133,0.907969,0.503826
M1,2,0.651118,-0.319318,-0.848077,0.605965
M1,3,-2.018168,0.740122,0.528813,-0.589001
M2,1,0.188695,-0.758872,-0.933237,0.955057
M2,2,0.190794,1.978757,2.605967,0.683509
M2,3,0.302665,1.693723,-1.706086,-1.159119
M3,5,-0.134841,0.390528,0.166905,0.184502
M3,6,0.807706,0.07296,0.638787,0.329646
M3,7,-0.497104,-0.75407,-0.943406,0.484752


In [148]:
# Since there are two level indexes, index attribute returns index values as pairs in a tuple.

df.index

MultiIndex([('M1', 1),
            ('M1', 2),
            ('M1', 3),
            ('M2', 1),
            ('M2', 2),
            ('M2', 3),
            ('M3', 5),
            ('M3', 6),
            ('M3', 7)],
           names=['Group', 'Num'])

In [149]:
df.index.levels

# We saw the unique list of index values in each level

FrozenList([['M1', 'M2', 'M3'], [1, 2, 3, 5, 6, 7]])

In [150]:
df.index.get_level_values(0)

# We've seen all the names at index level 0

Index(['M1', 'M1', 'M1', 'M2', 'M2', 'M2', 'M3', 'M3', 'M3'], dtype='object', name='Group')

In [151]:
df.index.get_level_values("Group")

# We have seen all the values in the index named "Group"

Index(['M1', 'M1', 'M1', 'M2', 'M2', 'M2', 'M3', 'M3', 'M3'], dtype='object', name='Group')

Now let's show how to index this! For index hierarchy we use ``df.loc[]``, if this was on the columns axis, you would just use normal bracket notation ``df[]``. Calling one level of the index returns the sub-dataframe:

In [152]:
df[["A"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,A
Group,Num,Unnamed: 2_level_1
M1,1,2.70685
M1,2,0.651118
M1,3,-2.018168
M2,1,0.188695
M2,2,0.190794
M2,3,0.302665
M3,5,-0.134841
M3,6,0.807706
M3,7,-0.497104


In [153]:
df[["A", "B"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
M1,1,2.70685,0.628133
M1,2,0.651118,-0.319318
M1,3,-2.018168,0.740122
M2,1,0.188695,-0.758872
M2,2,0.190794,1.978757
M2,3,0.302665,1.693723
M3,5,-0.134841,0.390528
M3,6,0.807706,0.07296
M3,7,-0.497104,-0.75407


In [154]:
df.loc['M1']

Unnamed: 0_level_0,A,B,C,D
Num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,2.70685,0.628133,0.907969,0.503826
2,0.651118,-0.319318,-0.848077,0.605965
3,-2.018168,0.740122,0.528813,-0.589001


In [155]:
df.loc[("M1", 2)]

A    0.651118
B   -0.319318
C   -0.848077
D    0.605965
Name: (M1, 2), dtype: float64

In [156]:
df.loc[[("M1", 2)]]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
M1,2,0.651118,-0.319318,-0.848077,0.605965


In [157]:
df.loc["M1", "A":"C"]

Unnamed: 0_level_0,A,B,C
Num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2.70685,0.628133,0.907969
2,0.651118,-0.319318,-0.848077
3,-2.018168,0.740122,0.528813


In [158]:
df.loc[[("M1", 2)], "A":"C"]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M1,2,0.651118,-0.319318,-0.848077


In [159]:
df.loc["M1":"M2"]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
M1,1,2.70685,0.628133,0.907969,0.503826
M1,2,0.651118,-0.319318,-0.848077,0.605965
M1,3,-2.018168,0.740122,0.528813,-0.589001
M2,1,0.188695,-0.758872,-0.933237,0.955057
M2,2,0.190794,1.978757,2.605967,0.683509
M2,3,0.302665,1.693723,-1.706086,-1.159119


In [160]:
df.loc[("M1", 2):("M2", 3)]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
M1,2,0.651118,-0.319318,-0.848077,0.605965
M1,3,-2.018168,0.740122,0.528813,-0.589001
M2,1,0.188695,-0.758872,-0.933237,0.955057
M2,2,0.190794,1.978757,2.605967,0.683509
M2,3,0.302665,1.693723,-1.706086,-1.159119


In [161]:
df.loc[("M1", 2): "M2"]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
M1,2,0.651118,-0.319318,-0.848077,0.605965
M1,3,-2.018168,0.740122,0.528813,-0.589001
M2,1,0.188695,-0.758872,-0.933237,0.955057
M2,2,0.190794,1.978757,2.605967,0.683509
M2,3,0.302665,1.693723,-1.706086,-1.159119


In [162]:
df.loc[[("M2", 3), ("M3", 5)]]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
M2,3,0.302665,1.693723,-1.706086,-1.159119
M3,5,-0.134841,0.390528,0.166905,0.184502


In [281]:
# tüm M ler ve sadece 2. levellarda B sutunu
df.loc[(slice(None), 2), ["B"]]

Unnamed: 0,Unnamed: 1,B
M1,2,-0.319318
M2,2,1.978757


# Pandas loc vs. iloc vs. at vs. iat?

loc : only work on index

iloc : work on position

at : get scalar values. It's a very fast loc

iat : Get scalar values. It's a very fast iloc

at and iat are meant to access a scalar, that is, a single element in the dataframe, while loc and iloc are ments to access several elements at the same time, potentially to perform vectorized operations.

In [None]:
# iat ile position ile single value, at ile labella single value

In [221]:
df88 = pd.DataFrame(np.arange(15).reshape(5,3), columns = ["col1", "col2", "col3"])
df88

Unnamed: 0,col1,col2,col3
0,0,1,2
1,3,4,5
2,6,7,8
3,9,10,11
4,12,13,14


In [224]:
df88.at[0,"col2"]

1

In [225]:
df88.iat[0,1]

1

More information for Multiindex and Advanced Indexing, visit [**Pandas Official Documentation**](https://pandas.pydata.org/docs/user_guide/advanced.html)

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Some Other Useful Methods with Iris Dataset</p>

<a id="11"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

### Let's apply functions/attributes/methods we have learnt for "iris dataset" 

In [238]:
import seaborn as sns

In [239]:
sns.get_dataset_names()

['anagrams',
 'anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'dowjones',
 'exercise',
 'flights',
 'fmri',
 'geyser',
 'glue',
 'healthexp',
 'iris',
 'mpg',
 'penguins',
 'planets',
 'seaice',
 'taxis',
 'tips',
 'titanic']

In [240]:
df = sns.load_dataset('iris')
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [166]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [167]:
df.shape

(150, 5)

In [168]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [169]:
df.sample(4)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
30,4.8,3.1,1.6,0.2,setosa
66,5.6,3.0,4.5,1.5,versicolor
117,7.7,3.8,6.7,2.2,virginica
96,5.7,2.9,4.2,1.3,versicolor


In [170]:
df.describe()
# numeric olanların istatistiki verileri gelir

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [171]:
# df.describe().T veya df.describe().transpose()

df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
sepal_length,150.0,5.843333,0.828066,4.3,5.1,5.8,6.4,7.9
sepal_width,150.0,3.057333,0.435866,2.0,2.8,3.0,3.3,4.4
petal_length,150.0,3.758,1.765298,1.0,1.6,4.35,5.1,6.9
petal_width,150.0,1.199333,0.762238,0.1,0.3,1.3,1.8,2.5


In [172]:
df.describe(include="all")

# "number" and "object" can be used as include/exclude parameter. artik sadece numeric degil hepsi gorulur
# kategorik icin unique, top ve freq'un en sonuna bakip genel izlenim, frequency, max olanı vb ogrenilir

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
count,150.0,150.0,150.0,150.0,150
unique,,,,,3
top,,,,,setosa
freq,,,,,50
mean,5.843333,3.057333,3.758,1.199333,
std,0.828066,0.435866,1.765298,0.762238,
min,4.3,2.0,1.0,0.1,
25%,5.1,2.8,1.6,0.3,
50%,5.8,3.0,4.35,1.3,
75%,6.4,3.3,5.1,1.8,


In [242]:
df.describe(include="object")
# sadece kategorikleri gormek icin

Unnamed: 0,species
count,150
unique,3
top,setosa
freq,50


In [173]:
df.corr()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
sepal_length,1.0,-0.11757,0.871754,0.817941
sepal_width,-0.11757,1.0,-0.42844,-0.366126
petal_length,0.871754,-0.42844,1.0,0.962865
petal_width,0.817941,-0.366126,0.962865,1.0


In [174]:
df.corr()[["sepal_length"]]

Unnamed: 0,sepal_length
sepal_length,1.0
sepal_width,-0.11757
petal_length,0.871754
petal_width,0.817941


In [175]:
df['petal_length'].corr(df["petal_width"])

0.9628654314027961

In [176]:

df.species.value_counts(dropna=False)
# value_counts kategorik verilerde kullanılır ve kac cesit unique deger oldugunu gosterir

setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64

In [177]:
df['species'].value_counts(dropna=False, normalize=True)

setosa        0.333333
versicolor    0.333333
virginica     0.333333
Name: species, dtype: float64

In [178]:
df.species.unique()
# unique degerlerin hangileri oldugunu , isimlerini ve dtypeını gosterir

array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [179]:
df.species.nunique()
# nunique ise kaç cesit farklı unique veri oldugunu gosterir

3

In [243]:
df.loc[df["species"] == "setosa", ["species", "sepal_length"]]

# dataframelerde calisirken conditionlarda loc kullanmaya dikkat, digerleri sorun verebiliyor 

Unnamed: 0,species,sepal_length
0,setosa,5.1
1,setosa,4.9
2,setosa,4.7
3,setosa,4.6
4,setosa,5.0
5,setosa,5.4
6,setosa,4.6
7,setosa,5.0
8,setosa,4.4
9,setosa,4.9


In [244]:
df[(df.sepal_length > 4) & (df.sepal_length < 5)]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
13,4.3,3.0,1.1,0.1,setosa
8,4.4,2.9,1.4,0.2,setosa
42,4.4,3.2,1.3,0.2,setosa
38,4.4,3.0,1.3,0.2,setosa
41,4.5,2.3,1.3,0.3,setosa
3,4.6,3.1,1.5,0.2,setosa
6,4.6,3.4,1.4,0.3,setosa
47,4.6,3.2,1.4,0.2,setosa
22,4.6,3.6,1.0,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa


In [245]:
df[(df.sepal_length > 4) & (df.sepal_length < 5)].sort_values("sepal_length")

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
13,4.3,3.0,1.1,0.1,setosa
8,4.4,2.9,1.4,0.2,setosa
42,4.4,3.2,1.3,0.2,setosa
38,4.4,3.0,1.3,0.2,setosa
41,4.5,2.3,1.3,0.3,setosa
3,4.6,3.1,1.5,0.2,setosa
6,4.6,3.4,1.4,0.3,setosa
47,4.6,3.2,1.4,0.2,setosa
22,4.6,3.6,1.0,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa


In [246]:
df[(df.sepal_length > 4) & (df.sepal_length < 5)][["species"]]

Unnamed: 0,species
1,setosa
2,setosa
3,setosa
6,setosa
8,setosa
9,setosa
11,setosa
12,setosa
13,setosa
22,setosa


In [182]:
# yukarıdaki resultta bir virginica var, bu hatali mi diye merak edip direkt onu kontrol etmek icin:
df[(df.species == "virginica") & (df.sepal_length > 4)  & (df.sepal_length < 5)]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
106,4.9,2.5,4.5,1.7,virginica


In [251]:

df.loc[((df.species == "virginica") & (df.sepal_length > 4)  & (df.sepal_length < 5)), ["sepal_length"]]
# loc'suz haliyle filtrelesek bunu getiremezdik.

Unnamed: 0,sepal_length
106,4.9


In [183]:
df.sort_values(by='sepal_length', ascending=True)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
13,4.3,3.0,1.1,0.1,setosa
42,4.4,3.2,1.3,0.2,setosa
38,4.4,3.0,1.3,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
41,4.5,2.3,1.3,0.3,setosa
...,...,...,...,...,...
122,7.7,2.8,6.7,2.0,virginica
118,7.7,2.6,6.9,2.3,virginica
117,7.7,3.8,6.7,2.2,virginica
135,7.7,3.0,6.1,2.3,virginica


# PRACTICE

In [212]:
import pandas as pd
people_dict={"first":["Richard", "Robert", "Jason"],
             "last" :["Stone", "Deepdive", "Seaborn"],
             "email": ["richardstone@email.com", "robertdeepdive@email.com", 
                        "jasonseaborn@email.com"]}
df=pd.DataFrame(people_dict)
df

Unnamed: 0,first,last,email
0,Richard,Stone,richardstone@email.com
1,Robert,Deepdive,robertdeepdive@email.com
2,Jason,Seaborn,jasonseaborn@email.com


In [213]:
df[(df['first']=="Richard") | (df['last'] == 'Deepdive')]

Unnamed: 0,first,last,email
0,Richard,Stone,richardstone@email.com
1,Robert,Deepdive,robertdeepdive@email.com


In [236]:
data = pd.DataFrame({
    'age' :     [ 10, 22, 13, 21, 12, 11, 17],
    'section' : [ 'A', 'B', 'C', 'B', 'B', 'A', 'A'],
    'city' :    [ 'Gurgaon', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai'],
    'gender' :  [ 'M', 'F', 'F', 'M', 'M', 'M', 'F'],
    'favourite_color' : [ 'red', 'blue', 'yellow', 'pink', 'black', 'green', 'red']
})

data

Unnamed: 0,age,section,city,gender,favourite_color
0,10,A,Gurgaon,M,red
1,22,B,Delhi,F,blue
2,13,C,Mumbai,F,yellow
3,21,B,Delhi,M,pink
4,12,B,Mumbai,M,black
5,11,A,Delhi,M,green
6,17,A,Mumbai,F,red


In [215]:
data[data["age"]>=21]

Unnamed: 0,age,section,city,gender,favourite_color
1,22,B,Delhi,F,blue
3,21,B,Delhi,M,pink


In [216]:
data[data["age"]>21]

Unnamed: 0,age,section,city,gender,favourite_color
1,22,B,Delhi,F,blue


In [237]:
data['new'] = data['city'].str[:2]  # str olmazsa :2 gurgaon ve delhi yazdrır ilk geri kalanlara nan
print(data)
print(data.set_index("new"))

   age section     city gender favourite_color new
0   10       A  Gurgaon      M             red  Gu
1   22       B    Delhi      F            blue  De
2   13       C   Mumbai      F          yellow  Mu
3   21       B    Delhi      M            pink  De
4   12       B   Mumbai      M           black  Mu
5   11       A    Delhi      M           green  De
6   17       A   Mumbai      F             red  Mu
     age section     city gender favourite_color
new                                             
Gu    10       A  Gurgaon      M             red
De    22       B    Delhi      F            blue
Mu    13       C   Mumbai      F          yellow
De    21       B    Delhi      M            pink
Mu    12       B   Mumbai      M           black
De    11       A    Delhi      M           green
Mu    17       A   Mumbai      F             red


In [218]:
data = pd.DataFrame({
    'age' :     [ 10, 22, 13, 21, 12, 11, 17],
    'section' : [ 'A', 'B', 'C', 'B', 'B', 'A', 'A'],
    'city' :    [ 'Gurgaon', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai'],
    'gender' :  [ 'M', 'F', 'F', 'M', 'M', 'M', 'F'],
    'favourite_color' : [ 'red', 'blue', 'yellow', 'pink', 'black', 'green', 'red']
})

data.set_index('section', inplace=True)
# burdaki inplace su demek: normalde set_index yaptigimizda ana data setini degistirmez. data yazınca yine eskisi gelir.
# bunu 2 yolla kalıcı yapabiliriz: 1. data'ya yeniden atama yaparak (data = data.set_index('section',))
# ya da inplace key argumenti default olan false'dan true'ya cevirerek. 
data

Unnamed: 0_level_0,age,city,gender,favourite_color
section,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,10,Gurgaon,M,red
B,22,Delhi,F,blue
C,13,Mumbai,F,yellow
B,21,Delhi,M,pink
B,12,Mumbai,M,black
A,11,Delhi,M,green
A,17,Mumbai,F,red


In [219]:
data.drop('city', axis = 1, inplace=True)
data

Unnamed: 0_level_0,age,gender,favourite_color
section,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,10,M,red
B,22,F,blue
C,13,F,yellow
B,21,M,pink
B,12,M,black
A,11,M,green
A,17,F,red
