![image-2.png](attachment:image-2.png)

# <font color='green'> <b>Importing Libraries </b><font color='black'>

In [1]:
import numpy as np
import pandas as pd

## <font color='blue'> <b>Basic Attributes & Methods of Series</b><font color='black'>

**SOME COMMON ATTRIBUTES** [Official Pandas API Document](https://pandas.pydata.org/docs/reference/api/pandas.Series.html)<br>

**Series.values:** Returns the values of the series as a Numpy array.

**Series.index:** Returns the indices of the series.

**Series.dtype:** Returns the data type of the series.

**Series.size:** Returns the number of elements in the series.

**Series.shape:** Returns the dimensions of the series.

**Series.ndim:** Returns the number of dimensions of the series.

**Series.head():** Returns the first n elements of the series.

**Series.tail():** Returns the last n elements of the series.

**Series.sample():** Returns a random n elements from the series.

**Series.describe():** Returns the statistical summary of the series.

**Series.sort_index:** Sorts the series by its indices.

**Series.sort_values():** Sorts the series by its values.

**Series.isnull():** Checks whether each element in the series is null (None).

**Series.fillna():** Fills null values with a specified value.

**Series.dropna():** Removes null values from the series.

**Series.isin():** Checks whether the elements in the series are present in the given values.

In [2]:
np.random.seed(42)

seri = pd.Series(np.random.randint(0, 50, 8))
seri

0    38
1    28
2    14
3    42
4     7
5    20
6    38
7    18
dtype: int32

In [3]:
seri.size

8

In [4]:
seri.shape

(8,)

In [5]:
seri.ndim

1

In [6]:
seri.head()  #The default value is 5. It shows the first 5 values from the beginning.

0    38
1    28
2    14
3    42
4     7
dtype: int32

In [7]:
seri.head(3)  #We requested the first 3 values from the beginning.

0    38
1    28
2    14
dtype: int32

In [8]:
seri.tail()  #It returns the last 5 rows by default.

3    42
4     7
5    20
6    38
7    18
dtype: int32

In [None]:
seri.tail(2) #We requested to fetch the last 2 rows.

In [9]:
seri.sample()  #It returns a random row.

0    38
dtype: int32

In [11]:
seri.sort_values() #sorted

4     7
2    14
7    18
5    20
1    28
0    38
6    38
3    42
dtype: int32

In [12]:
seri.sort_values(ascending=False)  #The default value of the ascending parameter is True. If it's set to False, it sorts in descending order from largest to smallest.

3    42
0    38
6    38
1    28
5    20
7    18
2    14
4     7
dtype: int32

In [13]:
seri.sort_index()

0    38
1    28
2    14
3    42
4     7
5    20
6    38
7    18
dtype: int32

In [14]:
seri.sort_index(ascending=False) #It sorted the indices in reverse order.

7    18
6    38
5    20
4     7
3    42
2    14
1    28
0    38
dtype: int32

# <font color='green'> <b>DataFrames </b><font color='black'>

## <font color='blue'> <b>Creating a DataFrame</b><font color='black'>

DataFrame is a two-dimensional data collection.

It's a data structure where data is stored in a tabular format.

Data sets are organized into rows and columns; a DataFrame can store multiple data sets.

We can think of a DataFrame as a collection of Series objects that share the same index.

We can perform various arithmetic operations on a DataFrame such as column/row selection and addition.

We can import DataFrames from external storage; SQL database, CSV file, and Excel file.

[SOURCE01](https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm), 
[SOURCE02](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), 
[SOURCE03](https://morioh.com/p/2528ac775b1b), 
[SOURCE04](https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python), 
[SOURCE05](https://www.guru99.com/python-pandas-tutorial.html), 
[SOURCE06](https://realpython.com/pandas-dataframe/) &
[SOURCE07](https://towardsdatascience.com/a-simple-guide-to-pandas-dataframes-b125f64e1453)<br>
[VIDEO SOURCE01](https://www.youtube.com/watch?v=zmdjNSmRXF4), 
[VIDEO SOURCE02](https://www.youtube.com/watch?v=F6kmIpWWEdU) &
[VIDEO SOURCE03](https://towardsdatascience.com/pandas-dataframe-basics-3c16eb35c4f3)<br>

### <font color='blue'> <b>Creating a DataFrame Using the Lists of Data & Columns</b><font color='black'>

In [3]:
list_1 = [[1,2,3],[4,5,6]]

column_name = ["A", "B", "C"]

df = pd.DataFrame(data = list_1, columns= column_name)
df

Unnamed: 0,A,B,C
0,1,2,3
1,4,5,6


### <font color='blue'> <b>Creating a DataFrame Using a Numpy Arrays</b><font color='black'>

In [9]:
arr1 = np.arange(1, 27, 3).reshape(3,3)
arr1

array([[ 1,  4,  7],
       [10, 13, 16],
       [19, 22, 25]])

In [10]:
arr1.ndim

2

In [11]:
df = pd.DataFrame(arr1)
df

Unnamed: 0,0,1,2
0,1,4,7
1,10,13,16
2,19,22,25


In [16]:
arr2 = np.arange(1,36,4).reshape(3,3)
arr2

array([[ 1,  5,  9],
       [13, 17, 21],
       [25, 29, 33]])

In [22]:
df2 = pd.DataFrame(arr2, columns= range(1,4))
df2

Unnamed: 0,1,2,3
0,1,6,11
1,16,21,26
2,31,36,41


In [None]:
df = pd.DataFrame(arr1, columns= range(1,4)) #I tried a different method for column numbering.
df

In [15]:
df = pd.DataFrame(data = arr1, columns= ["A1", "B2", "A3"])
df

Unnamed: 0,A1,B2,A3
0,1,4,7
1,10,13,16
2,19,22,25


In [18]:
df = pd.DataFrame(arr1, columns=["A1", "B2", "A3"], index = ["B1", "B2", "B3"])
df

Unnamed: 0,A1,B2,A3
B1,1,4,7
B2,10,13,16
B3,19,22,25


### <font color='blue'> <b>Creating a DataFrame Using a Dictionary</b><font color='black'>

In [23]:
data = {'Name':['Ayşe', 'Ahmet', 'Mehmet'],'Age':[20,30,40]}


In [24]:
pd.Series(data)

Name    [Ayşe, Ahmet, Mehmet]
Age              [20, 30, 40]
dtype: object

In [25]:
pd.DataFrame(data = data) #parametre=argument   #In series, keys serve as indices, whereas in dataframes, keys are analogous to columns.

Unnamed: 0,Name,Age
0,Ayşe,20
1,Ahmet,30
2,Mehmet,40


In [24]:
pd.DataFrame(data, columns= ["Name", "Age", "Job"]) #Due to the lack of job information, it resulted in NaN.

Unnamed: 0,Name,Age,Job
0,Ayşe,20,
1,Ahmet,30,
2,Mehmet,40,


In [25]:
pd.DataFrame(data, columns= ["aa", "bb", "cc"])

Unnamed: 0,aa,bb,cc


In [26]:
pd.DataFrame(data, columns= ["Name", "bb", "cc"])

Unnamed: 0,Name,bb,cc
0,Ayşe,,
1,Ahmet,,
2,Mehmet,,


## <font color='blue'> <b>Basic Attributes & Methods of DataFrames</b><font color='black'>

In [26]:
dict_1 = {'Name':['Ayşe', 'Ahmet', 'Mehmet'],'Age':[20,30,40]}

In [27]:
df = pd.DataFrame(data = dict_1)
df

Unnamed: 0,Name,Age
0,Ayşe,20
1,Ahmet,30
2,Mehmet,40


In [31]:
df.head(2)

Unnamed: 0,Name,Age
0,Ayşe,20
1,Ahmet,30


In [33]:
df.tail(2)

Unnamed: 0,Name,Age
1,Ahmet,30
2,Mehmet,40


In [37]:
df.sample()

Unnamed: 0,Name,Age
2,Mehmet,40


In [38]:
df.columns

Index(['Name', 'Age'], dtype='object')

In [40]:
df.columns[0]

'Name'

In [41]:
df.columns[1]

'Age'

In [42]:
df.columns[2]

IndexError: index 2 is out of bounds for axis 0 with size 2

In [39]:
df.index

RangeIndex(start=0, stop=3, step=1)

In [10]:
df

Unnamed: 0,Name,Age
0,Ayşe,20
1,Ahmet,30
2,Mehmet,40


In [9]:
df.mean

<bound method DataFrame.mean of      Name  Age
0    Ayşe   20
1   Ahmet   30
2  Mehmet   40>

In [13]:
df.Age.mean()

30.0

## <font color='blue'> <b>Indexing, Slicing & Selection</b><font color='black'>

[Source01](https://pandas.pydata.org/docs/user_guide/indexing.html),
[Source02](https://www.geeksforgeeks.org/slicing-indexing-manipulating-and-cleaning-pandas-dataframe/),
[Source03](https://www.tutorialspoint.com/python_pandas/python_pandas_indexing_and_selecting_data.htm),
[Source04](https://www.dataquest.io/blog/tutorial-indexing-dataframes-in-pandas/)

In [29]:
data={"isim":["Ali", "Ayşe", "Fatma", "Veli"], "boy-cm":[170,160,170,180], "kilo-kg":[70, 55, 60, 80]}
data

{'isim': ['Ali', 'Ayşe', 'Fatma', 'Veli'],
 'boy-cm': [170, 160, 170, 180],
 'kilo-kg': [70, 55, 60, 80]}

In [30]:
df = pd.DataFrame(data, index= ["A", "B", "C", "D"] )
df

Unnamed: 0,isim,boy-cm,kilo-kg
A,Ali,170,70
B,Ayşe,160,55
C,Fatma,170,60
D,Veli,180,80


In [18]:
df.shape

(4, 3)

In [19]:
df.isim

A      Ali
B     Ayşe
C    Fatma
D     Veli
Name: isim, dtype: object

In [22]:
df["isim"]

A      Ali
B     Ayşe
C    Fatma
D     Veli
Name: isim, dtype: object

In [23]:
df.columns

Index(['isim', 'boy-cm', 'kilo-kg'], dtype='object')

In [26]:
df.kilo_kg

AttributeError: 'DataFrame' object has no attribute 'kilo_kg'

In [29]:
df["kilo-kg"]

A    70
B    55
C    60
D    80
Name: kilo-kg, dtype: int64

In [30]:
a = df["kilo-kg"]
type(a)

pandas.core.series.Series

In [31]:
df[["kilo-kg"]]

Unnamed: 0,kilo-kg
A,70
B,55
C,60
D,80


In [32]:
b = df[["kilo-kg"]]
type(b)

pandas.core.frame.DataFrame

In [33]:
df[["boy-cm", "kilo-kg"]]

Unnamed: 0,boy-cm,kilo-kg
A,170,70
B,160,55
C,170,60
D,180,80


In [None]:
df["A"] #It throws an error because it interprets it as a column.

In [37]:
df

Unnamed: 0,isim,boy-cm,kilo-kg
A,Ali,170,70
B,Ayşe,160,55
C,Fatma,170,60
D,Veli,180,80


In [36]:
df["A": "C"] #We instructed it to go from A to C, inclusive.

Unnamed: 0,isim,boy-cm,kilo-kg
A,Ali,170,70
B,Ayşe,160,55
C,Fatma,170,60


In [38]:
df[0:1]

Unnamed: 0,isim,boy-cm,kilo-kg
A,Ali,170,70


## <font color='blue'> <b>Creating a New Column</b><font color='black'>

In [31]:
df

Unnamed: 0,isim,boy-cm,kilo-kg
A,Ali,170,70
B,Ayşe,160,55
C,Fatma,170,60
D,Veli,180,80


In [32]:
df["BMI"] = round(df["kilo-kg"] / (df["boy-cm"] / 100) **2, 2)

In [33]:
df

Unnamed: 0,isim,boy-cm,kilo-kg,BMI
A,Ali,170,70,24.22
B,Ayşe,160,55,21.48
C,Fatma,170,60,20.76
D,Veli,180,80,24.69


In [34]:
df["new"] = np.arange(4)

In [35]:
df

Unnamed: 0,isim,boy-cm,kilo-kg,BMI,new
A,Ali,170,70,24.22,0
B,Ayşe,160,55,21.48,1
C,Fatma,170,60,20.76,2
D,Veli,180,80,24.69,3


In [9]:
df = pd.read_csv("adult_eda.csv")
df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12.0,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9.0,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9.0,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9.0,Never-married,Adm-clerical,,White,Male,0,0,20,United-States,<=50K


In [10]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [11]:
df.tail()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
32556,27,Private,257302,Assoc-acdm,12.0,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9.0,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9.0,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9.0,Never-married,Adm-clerical,,White,Male,0,0,20,United-States,<=50K
32560,52,Self-emp-inc,287927,HS-grad,9.0,Married-civ-spouse,Exec-managerial,Wife,White,Female,15024,0,40,United-States,>50K


In [12]:
df.dtypes  #dtypes is used in DataFrames, while dtype is used in Series.

age                 int64
workclass          object
fnlwgt              int64
education          object
education-num     float64
marital-status     object
occupation         object
relationship       object
race               object
sex                object
capital-gain        int64
capital-loss        int64
hours-per-week      int64
native-country     object
salary             object
dtype: object

In [13]:
df.size  #The product of the number of rows and columns.

488415

In [15]:
df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12.0,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9.0,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9.0,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9.0,Never-married,Adm-clerical,,White,Male,0,0,20,United-States,<=50K


In [16]:
df.shape

(32561, 15)

In [18]:
df.ndim  #It shows the size or dimension, which is unrelated to the number of rows and columns.

2

In [19]:
df.sample()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
8105,32,Self-emp-inc,113543,Prof-school,15.0,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,40,United-States,>50K


In [20]:
df.sample(20)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
11071,58,Private,430005,10th,6.0,Married-civ-spouse,Transport-moving,Husband,White,Male,0,0,50,United-States,>50K
29742,46,Private,270565,Assoc-voc,11.0,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,60,United-States,<=50K
14948,39,Self-emp-not-inc,336793,Masters,14.0,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,55,United-States,>50K
20799,40,Private,96509,Some-college,10.0,Married-civ-spouse,Prof-specialty,Husband,Amer-Indian-Eskimo,Male,0,0,40,United-States,>50K
25933,38,Private,102945,Bachelors,13.0,Married-civ-spouse,Sales,Husband,White,Male,0,0,52,United-States,>50K
13442,20,Private,131611,Some-college,10.0,Never-married,Adm-clerical,,White,Male,0,0,48,United-States,<=50K
21494,28,Self-emp-not-inc,282398,HS-grad,9.0,Divorced,Craft-repair,Not-in-family,White,Male,0,0,35,United-States,<=50K
20508,45,Private,116163,HS-grad,9.0,Separated,Exec-managerial,Not-in-family,White,Female,0,0,40,United-States,<=50K
28225,39,Private,249720,Bachelors,13.0,Divorced,Exec-managerial,Unmarried,Black,Female,0,0,60,United-States,<=50K
11493,37,Private,267085,Some-college,10.0,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K


In [22]:
df.isnull()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
32557,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
32558,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
32559,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False


In [24]:
df.isnull().sum()

age                  0
workclass            0
fnlwgt               0
education            0
education-num      802
marital-status       0
occupation           0
relationship      5068
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country       0
salary               0
dtype: int64

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             32561 non-null  int64  
 1   workclass       32561 non-null  object 
 2   fnlwgt          32561 non-null  int64  
 3   education       32561 non-null  object 
 4   education-num   31759 non-null  float64
 5   marital-status  32561 non-null  object 
 6   occupation      32561 non-null  object 
 7   relationship    27493 non-null  object 
 8   race            32561 non-null  object 
 9   sex             32561 non-null  object 
 10  capital-gain    32561 non-null  int64  
 11  capital-loss    32561 non-null  int64  
 12  hours-per-week  32561 non-null  int64  
 13  native-country  32561 non-null  object 
 14  salary          32561 non-null  object 
dtypes: float64(1), int64(5), object(9)
memory usage: 3.7+ MB
