<img src=https://i.ibb.co/6gCsHd6/1200px-Pandas-logo-svg.png width="700" height="200">

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#060108; font-size:200%; text-align:center; border-radius:10px 10px;">Data Analysis with Python</p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#4d77cf; font-size:200%; text-align:center; border-radius:10px 10px;">Working with Text & Time Data</p>

<a id="toc"></a>

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Content</p>

* [WORKING WITH TEXT DATA](#0)
* [IMPORTING LIBRARIES NEEDED IN THIS NOTEBOOK](#00)
* [WORKING WITH TIME DATA](#1)
    * [String Methods](#1.1)
    * [Most Usefull String Methods](#1.2)
    * [Dummy Operations](#1.3)
* [WORKING WITH TIME DATA](#2)
    * [pd.to_datetime()](#2.1)
    * [Series.dt()](#2.2)
    * [Datetime Module](#2.3)
    * [Series.dt()](#2.4)
* [OPERATION WITH DATETIME OBJECT](#3)
* [THE END OF THE SESSION](#4)

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:center; border-radius:10px 10px;">Importing Libraries Needed in This Notebook</p>

<a id="00"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Working with Text Data</p>

<a id="1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

In this notebook, we will first discuss the string operations with our basic Series/Index and learn how to apply these string functions on the DataFrame.

Pandas provides a set of string functions which make it easy to operate on string data. Most importantly, these functions ignore (or exclude) missing/NaN values. Almost, all of these methods work with Python string functions [Refer To Official Python Documentation]( https://docs.python.org/3/library/stdtypes.html#string-methods). So, while studying with the Series Object, convert it to String Object and then perform the operation.

In addition, according to [Pandas Official Document](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html), there are two ways to store text data in pandas:
- object -dtype NumPy array.
- StringDtype extension type.

Pandas recommend using StringDtype to store text data.

[SOURCE01](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html), [SOURCE02](https://www.w3schools.com/python/python_ref_string.asp)

### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">String Methods</p>

<a id="1.1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

Strings implement all of the common sequence operations, along with the additional methods described at [the official documentation](https://docs.python.org/3/library/stdtypes.html#string-methods).

Strings also support two styles of string formatting, one providing a large degree of flexibility and customization (**Please see the information about** [str.format()](https://docs.python.org/3/library/stdtypes.html#str.format), [Format String Syntax](https://docs.python.org/3/library/string.html#formatstrings) and [Custom String Formatting](https://docs.python.org/3/library/string.html#string-formatting)) and the other based on C printf style formatting that handles a narrower range of types and is slightly harder to use correctly, but is often faster for the cases it can handle ([printf-style String Formatting](https://docs.python.org/3/library/stdtypes.html#old-string-formatting)).

The [Text Processing Services](https://docs.python.org/3/library/text.html#textservices) section of the standard library covers a number of other modules that provide various text related utilities (including regular expression support in the [re](https://docs.python.org/3/library/re.html#module-re) module).

Please watch [**``Video Source``**](https://www.youtube.com/watch?v=6JNwK6hEneg) for enhancing your understanding of working with Text Data in Pandas.  

**What are these String Methods? Now let us examine some of the most common and usefull String Methods and dig into them one by one:**

![image.png](attachment:f2458b69-54fd-45f2-b36d-00f6a63df843.png)

In [None]:
# pip install openpyxl
# eger anaconda olmasaydi sunu kurup acabilirdik excel dosyalarını

In [2]:
df0 = pd.read_excel("text_exercise.xlsx")
# uzantısı excel xlsx
df=df0.copy()
df

Unnamed: 0,id,staff,department,job,salary,age
0,M0001,Tom BLUE,HR,manager,"""$150,000""",52
1,M0002,JOHN BLACK,IT,manager,"""$180,000""",48
2,E0001,Micheal Brown,IT,data scientist,"""$150,000""",35
3,E0002,jason walker,HR,recruiter,130000dolar,38
4,E0003,Alex Green,IT,backend developer,"""$110,000""",-
5,E0004,OSCAR SMİTH,IT,frontend developer,"""$120,000""",32
6,E0005,Adrian STAR,IT,data scientist,"""$135,000""",40
7,E0006,Albert simon,IT,data scientist,125000dolar,35


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          8 non-null      object
 1   staff       8 non-null      object
 2   department  8 non-null      object
 3   job         8 non-null      object
 4   salary      8 non-null      object
 5   age         8 non-null      object
dtypes: object(6)
memory usage: 512.0+ bytes


### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">Most Usefull String Methods</p>

<a id="1.2"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

- **str.lower() =>** Converts a string into lower case
- **str.upper() =>** Converts a string into upper case
- **str.capitalize() =>** Converts the first character to upper case
- **str.title() =>** Converts the first character of each word to upper case
- **str.swapcase() =>** Swaps the case lower/upper

[SOURCE01](https://www.tutorialspoint.com/python_pandas/python_pandas_working_with_text_data.htm)
[SOURCE02](https://www.aboutdatablog.com/post/10-most-useful-string-functions-in-pandas)
[SOURCE03](https://towardsdatascience.com/5-must-know-pandas-operations-on-strings-4f88ca6b8e25)
[SOURCE04](https://towardsdatascience.com/pandas-string-operations-explained-fdfab7602fb4)
[SOURCE05](https://blog.devgenius.io/string-operations-on-pandas-dataframe-88af220439d1)
[SOURCE06](https://www.geeksforgeeks.org/string-manipulations-in-pandas-dataframe/)

In [3]:
df.staff.str.lower()
# str'siz error. pandas string icin str.lower(). tum serieye elementwise calisabilmesi icin str.

0         tom blue
1       john black
2    micheal brown
3     jason walker
4       alex green
5     oscar smi̇th
6      adrian star
7     albert simon
Name: staff, dtype: object

In [6]:
df["staff"].str.upper()

0         TOM BLUE
1       JOHN BLACK
2    MICHEAL BROWN
3     JASON WALKER
4       ALEX GREEN
5      OSCAR SMİTH
6      ADRIAN STAR
7     ALBERT SIMON
Name: staff, dtype: object

In [7]:
df["staff"].str.title()

0         Tom Blue
1       John Black
2    Micheal Brown
3     Jason Walker
4       Alex Green
5     Oscar Smi̇th
6      Adrian Star
7     Albert Simon
Name: staff, dtype: object

___

In [8]:
df["staff"].str.capitalize()

0         Tom blue
1       John black
2    Micheal brown
3     Jason walker
4       Alex green
5     Oscar smi̇th
6      Adrian star
7     Albert simon
Name: staff, dtype: object

In [9]:
df["staff"].str.swapcase()

0         tOM blue
1       john black
2    mICHEAL bROWN
3     JASON WALKER
4       aLEX gREEN
5     oscar smi̇th
6      aDRIAN star
7     aLBERT SIMON
Name: staff, dtype: object

In [4]:
# eğer arada numeric deger olan bir stringse:
arr = np.array(["ali", "veli", "20", "deli"])
dframe = pd.DataFrame(arr, columns= ["name"])
dframe

Unnamed: 0,name
0,ali
1,veli
2,20
3,deli


In [5]:
dframe.name.str.swapcase()
# numericleri aynen birakir, error vermez, cunku o da str

0     ALI
1    VELI
2      20
3    DELI
Name: name, dtype: object

___

# BOOLEAN sonuc dondurenler

- **str.isalpha()     =>** Returns True if all characters in the string are in the alphabet
- **str.isnumeric()   =>** Returns True if all characters in the string are numeric
- **str.isalnum()     =>** Returns True if all characters in the string are alphanumeric
- **str.endswith()	  =>** Returns true if the string ends with the specified value
- **str.startswith()  =>** Returns true if the string starts with the specified value
- **str.contains()	  =>** Returns a Boolean value True for each element if the substring contains in the element, else False.

[SOURCE01](https://careerkarma.com/blog/python-isalpha-isnumeric-isalnum/)
[SOURCE02](https://careerkarma.com/blog/python-startswith-and-endswith/)
[SOURCE03](https://www.geeksforgeeks.org/python-startswith-endswidth-function/)
[SOURCE04](https://towardsdatascience.com/check-for-a-substring-in-a-pandas-dataframe-column-4b949f64852#:~:text=The%20contains%20method%20in%20Pandas,str.)

___

In [14]:
df

Unnamed: 0,id,staff,department,job,salary,age
0,M0001,Tom BLUE,HR,manager,"""$150,000""",52
1,M0002,JOHN BLACK,IT,manager,"""$180,000""",48
2,E0001,Micheal Brown,IT,data scientist,"""$150,000""",35
3,E0002,jason walker,HR,recruiter,130000dolar,38
4,E0003,Alex Green,IT,backend developer,"""$110,000""",-
5,E0004,OSCAR SMİTH,IT,frontend developer,"""$120,000""",32
6,E0005,Adrian STAR,IT,data scientist,"""$135,000""",40
7,E0006,Albert simon,IT,data scientist,125000dolar,35


**isalpha()** Function in pandas python checks whether the string consists of alphabetic characters only. It returns True when alphabetic value is present and it returns False when the alphabetic value is not present.

In [6]:
df.job.str.isalpha()
# bosluk var falselarda 

0     True
1     True
2    False
3     True
4    False
5    False
6    False
7    False
Name: job, dtype: bool

In [8]:
# how can we avoid the spaces:
df.job.str.replace(" ", "").str.isalpha()

0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
Name: job, dtype: bool

**isnumeric()** checks whether all characters in each string are numeric. This is equivalent to running the Python string method str. isnumeric() for each element of the Series/Index.

In [17]:
df.age.str.isnumeric()
# 4 - nedeniyle False. 
# digerleri neden NaN, cunku dtype object, bakamadı tam. asagida cevirelim

0      NaN
1      NaN
2      NaN
3      NaN
4    False
5      NaN
6      NaN
7      NaN
Name: age, dtype: object

In [18]:
"10".isnumeric()
# karakterlere uyguluyor isnumeric, str metodu, dolayısyla int veya floate degil str'ye bakiyor

True

In [19]:
"10a".isnumeric()
# yani: STRINGIN ELEMANLARI TAMAMEN NUMERIC MI DEGIL MI ONA BAKIYOR

False

In [20]:
df.age.astype("string").str.isnumeric()
# yani bazı metotlar icin stringe cevirmek gerekiyor objecti

0     True
1     True
2     True
3     True
4    False
5     True
6     True
7     True
Name: age, dtype: boolean

**isalnum()** Function in python checks whether the string consists of alphanumeric characters. It returns True when alphanumeric value is present and it returns False when the alphanumeric value is not present. Alphanumeric means a character that is either a letter or a number.

In [21]:
df.salary.str.isalnum()

0    False
1    False
2    False
3     True
4    False
5    False
6    False
7     True
Name: salary, dtype: bool

In [22]:
df.salary.head(4)
# sayı ve karakterler var. digerlerinde tireler, dolar isaretleri oldugu icin false getirdi.

0     "$150,000"
1     "$180,000"
2     "$150,000"
3    130000dolar
Name: salary, dtype: object

Pandas **startswith()** tests if the start of each string element matches a pattern. It is yet another method to search and filter text data in Series or Data Frame. This method is Similar to Python’s startswith() method, but has different parameters and it works on Pandas objects only. Hence .str has to be prefixed everytime before calling this method, so that the compiler knows that it’s different from default function.

In [138]:
df.job.str.startswith("d")

0    False
1    False
2     True
3    False
4    False
5    False
6     True
7     True
Name: job, dtype: bool

In [26]:
df[df.job.str.startswith("d")]

Unnamed: 0,id,staff,department,job,salary,age
2,E0001,Micheal Brown,IT,data scientist,"""$150,000""",35
6,E0005,Adrian STAR,IT,data scientist,"""$135,000""",40
7,E0006,Albert simon,IT,data scientist,125000dolar,35


Pandas **endswith()** method is a built-in function that determines whether the given string ends with a specific sequence of characters.

In [27]:
df.job.str.endswith("per")

0    False
1    False
2    False
3    False
4     True
5     True
6    False
7    False
Name: job, dtype: bool

In [28]:
df[df.job.str.endswith("per")]
# bunun icindeki bir condition oluyor ve true ları getiriyor

Unnamed: 0,id,staff,department,job,salary,age
4,E0003,Alex Green,IT,backend developer,"""$110,000""",-
5,E0004,OSCAR SMİTH,IT,frontend developer,"""$120,000""",32


In [29]:
# sadece job sutunu icin
df[["job"]][df.job.str.endswith("per")]
# bunun icindeki bir condition oluyor ve true ları getiriyor

Unnamed: 0,job
4,backend developer
5,frontend developer


In [30]:
# veya loc ile
df.loc[df.job.str.endswith("per"), ["job"]]
# ilk kisim satirlar icin, job da sutun icin yazdık

Unnamed: 0,job
4,backend developer
5,frontend developer


The **contains()** method in Pandas allows you to search a column for a specific substring. The contains method returns boolean values for the Series with True for if the original Series value contains the substring and False if not [SOURCE](https://towardsdatascience.com/check-for-a-substring-in-a-pandas-dataframe-column-4b949f64852#:~:text=The%20contains%20method%20in%20Pandas,str.).

In [31]:
df.job.str.contains("data")

0    False
1    False
2     True
3    False
4    False
5    False
6     True
7     True
Name: job, dtype: bool

In [32]:
df.loc[df.job.str.contains("data"), ["job"]]

Unnamed: 0,job
2,data scientist
6,data scientist
7,data scientist


In [33]:
df.loc[df.job.str.contains("data")]
# eger sutun daraltması yapmazsak bu conditiona uyan diger sutunları da alırız

Unnamed: 0,id,staff,department,job,salary,age
2,E0001,Micheal Brown,IT,data scientist,"""$150,000""",35
6,E0005,Adrian STAR,IT,data scientist,"""$135,000""",40
7,E0006,Albert simon,IT,data scientist,125000dolar,35


In [34]:
# regex ile sorgulama
df.salary.str.contains("[a-z]+")
# a'dan zye tum karakterlerden herhangi biri- yani bir veya birden fazlası(+) varsa

0    False
1    False
2    False
3     True
4    False
5    False
6    False
7     True
Name: salary, dtype: bool

In [35]:
df[df.job.str.contains("data")]
# bu da conditionın locsuz hali, yukarıda benzeri var.

Unnamed: 0,id,staff,department,job,salary,age
2,E0001,Micheal Brown,IT,data scientist,"""$150,000""",35
6,E0005,Adrian STAR,IT,data scientist,"""$135,000""",40
7,E0006,Albert simon,IT,data scientist,125000dolar,35


In [37]:
df.loc[df.job.str.contains("data"), "salary"]

# ama row ve column olarak yazacaksak loc'suz olmaz
# job sutununda data gecen satırların salarylerini getirdik

2     "$150,000"
6     "$135,000"
7    125000dolar
Name: salary, dtype: object

In [9]:
df.salary

0     "$150,000"
1     "$180,000"
2     "$150,000"
3    130000dolar
4     "$110,000"
5     "$120,000"
6     "$135,000"
7    125000dolar
Name: salary, dtype: object

In [None]:
# sadece bu conditiona uyan data scienclerrı ds yapalım
# df.loc[df.job.str.contains("data"), "salary"] = ["DS")

In [140]:
df.loc[df.department=="IT", "department"] = "DS"
df

Unnamed: 0,id,staff,department,job,salary,age
0,M0001,Tom BLUE,HR,manager,"""$150,000""",52
1,M0002,JOHN BLACK,DS,manager,"""$180,000""",48
2,E0001,Micheal Brown,DS,data scientist,"""$150,000""",35
3,E0002,jason walker,HR,recruiter,130000dolar,38
4,E0003,Alex Green,DS,backend developer,"""$110,000""",-
5,E0004,OSCAR SMİTH,DS,frontend developer,"""$120,000""",32
6,E0005,Adrian STAR,DS,data scientist,"""$135,000""",40
7,E0006,Albert simon,DS,data scientist,125000dolar,35


In [141]:
# geri alalım
df.loc[df.department == "DS", "department"] = "IT"
# eger burada department diye sınırlamazsak sutunu, DS olan tum sutunların tum verilerini IT yapar
df

Unnamed: 0,id,staff,department,job,salary,age
0,M0001,Tom BLUE,HR,manager,"""$150,000""",52
1,M0002,JOHN BLACK,IT,manager,"""$180,000""",48
2,E0001,Micheal Brown,IT,data scientist,"""$150,000""",35
3,E0002,jason walker,HR,recruiter,130000dolar,38
4,E0003,Alex Green,IT,backend developer,"""$110,000""",-
5,E0004,OSCAR SMİTH,IT,frontend developer,"""$120,000""",32
6,E0005,Adrian STAR,IT,data scientist,"""$135,000""",40
7,E0006,Albert simon,IT,data scientist,125000dolar,35


we can use these string methods which returning boolean expression for creating condition and so selecting relative rows

___

- **str.strip()	=>** Returns a trimmed version of the string

- **str.replace() =>** Returns a string where a specified value is replaced with a specified value

- **str.split()	=>** Splits the string at the specified separator, and returns a list

- **str.find()	=>** Searches the string for a specified value and returns the position of where it was found

- **str.findall()	=>** Returns a list of all occurrence of the pattern.

- **str.join()	=>** Converts the elements of an iterable into a string

In [10]:
# df salary sutunu uzerinde calisalim
df.salary.str.strip("\"")
# escape sequence neden kullandık: tırnak isar normalde ozel karakterdi, string bitis baslangic yerler, backslash olarak 
# bu tirnagi strnin basi bitisini belirleyen " olmaktan cikarip karakter tırnak haline getirdik
# bunun yerine df.salary.str.strip('"') de yapabilirdik

0       $150,000
1       $180,000
2       $150,000
3    130000dolar
4       $110,000
5       $120,000
6       $135,000
7    125000dolar
Name: salary, dtype: object

In [42]:
# sagdaki dolar'ları kaldıralım
df.salary.str.strip("\"").str.rstrip("dolar")

0    $150,000
1    $180,000
2    $150,000
3      130000
4    $110,000
5    $120,000
6    $135,000
7      125000
Name: salary, dtype: object

In [43]:
# soldaki dolar isaretlerini kaldıralım
df.salary.str.strip("\"").str.rstrip("dolar").str.lstrip("$")

0    150,000
1    180,000
2    150,000
3     130000
4    110,000
5    120,000
6    135,000
7     125000
Name: salary, dtype: object

In [44]:
# bunları bu kadar uzun islemlerdense ksiaca da yapabiliriz
df.salary.str.strip("\"dolar$")
# hepsini yanyana yazarak da yapabiliriz

0    150,000
1    180,000
2    150,000
3     130000
4    110,000
5    120,000
6    135,000
7     125000
Name: salary, dtype: object

In [46]:
# virgüller strip kaldırmıyor, replace ile yaparız
df.salary.str.strip("\"dolar$").replace(",","")
# built-in replace error vermedi ama islev de gormedi. bu eger textin tamamı sadece , olsaydi islerdi.
# str.replace ise text icinde baska karakterler olsa da arayıp iceride istedigimizi buluyor.
# 2. bir fark: str.replace bir stringi ancak baska bir strinle degistirebiliyor. str.replace(",",9) icin error verir
# built-in replace ise bunu yapar.

0    150,000
1    180,000
2    150,000
3     130000
4    110,000
5    120,000
6    135,000
7     125000
Name: salary, dtype: object

In [45]:
# yukardaki nedenle ornegin tireleri (-) np.NaN yapmak icin str.replace degil replace kullanırız

In [47]:
df.salary.str.strip("\"dolar$").str.replace(",","")
# virgüller kalktı
# hersey duzeldi ama dtype halen object, onu da duzeltelim

0    150000
1    180000
2    150000
3    130000
4    110000
5    120000
6    135000
7    125000
Name: salary, dtype: object

In [142]:
df.salary = df.salary.str.strip('"$,dolar').str.replace(",","").astype(int)

In [143]:
df.salary
# artık dtype int ve tamamen numeric.

0    150000
1    180000
2    150000
3    130000
4    110000
5    120000
6    135000
7    125000
Name: salary, dtype: int64

___

**NOTE:** For a better using and understanding of strip, please revise escape characters in python [Source01 for Escape Characters](https://www.python-ds.com/python-3-escape-sequences) & [Source02 for Escape Characters](https://www.w3schools.com/python/gloss_python_escape_characters.asp)

### ``str.replace()`` vs **``.replace()``

- **Purpose:** Use **str.replace** for substring replacements on a single string column, and **replace** for any general replacement on one or more columns.

- **Usage:** **str.replace** can replace one thing at a time. **replace** lets you perform multiple independent replacements, i.e., replace many things at once.

- **Default behavior:** **str.replace** enables regex replacement by default. **replace** only performs a full match unless the regex=True switch is used.

In [144]:
df.job.replace("data","DATA")
# normal replace bunu yapamaz, ama yine errorsuz calisir

0               manager
1               manager
2        data scientist
3             recruiter
4     backend developer
5    frontend developer
6        data scientist
7        data scientist
Name: job, dtype: object

In [145]:
# str.replace ise:
df.job.str.replace("data","DATA")

0               manager
1               manager
2        DATA scientist
3             recruiter
4     backend developer
5    frontend developer
6        DATA scientist
7        DATA scientist
Name: job, dtype: object

In [54]:
df.age = df.age.replace("-", np.nan)

In [55]:
df.age

0    52.0
1    48.0
2    35.0
3    38.0
4     NaN
5    32.0
6    40.0
7    35.0
Name: age, dtype: float64

In [56]:
df.info()
# 2 suutunumuz numeric oldu

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          8 non-null      object 
 1   staff       8 non-null      object 
 2   department  8 non-null      object 
 3   job         8 non-null      object 
 4   salary      8 non-null      int64  
 5   age         7 non-null      float64
dtypes: float64(1), int64(1), object(4)
memory usage: 512.0+ bytes


In [57]:
# isim ve soyisimleri ayıralım
df.staff.str.split(pat= " ")

0         [Tom, BLUE]
1       [JOHN, BLACK]
2    [Micheal, Brown]
3     [jason, walker]
4       [Alex, Green]
5      [OSCAR, SMİTH]
6      [Adrian, STAR]
7     [Albert, simon]
Name: staff, dtype: object

In [58]:
df.staff.str.title().str.split(pat= " ")

0         [Tom, Blue]
1       [John, Black]
2    [Micheal, Brown]
3     [Jason, Walker]
4       [Alex, Green]
5     [Oscar, Smi̇th]
6      [Adrian, Star]
7     [Albert, Simon]
Name: staff, dtype: object

In [59]:
# sadece isimleri alarak yeni bir sutun yapabiliriz
df.staff.str.title().str.split(pat= " ").str[0]

0        Tom
1       John
2    Micheal
3      Jason
4       Alex
5      Oscar
6     Adrian
7     Albert
Name: staff, dtype: object

In [60]:
df.staff.str.title().str.split(pat= " ").str[1]

0      Blue
1     Black
2     Brown
3    Walker
4     Green
5    Smi̇th
6      Star
7     Simon
Name: staff, dtype: object

**Indexing with .str[]** 

You can use [] notation to directly index by position locations [SOURCE](https://pandas.pydata.org/pandas-docs/version/0.15/text.html). 

In [61]:
df["first_name"] = df.staff.str.title().str.split(pat= " ").str[0]
df["last_name"] = df.staff.str.title().str.split(pat= " ").str[1]

In [None]:
# expand ile aynı islem:
# df["first_name"] = df.staff.str.split(expand=True).iloc[:,0]
# df["last_name"] = df.staff.str.split(expand=True).iloc[:,1]

In [63]:
# staff sutununu drop edebiliriz artik
df.drop("staff", axis=1, inplace=True)

In [11]:
for  i in df.staff.str.title().str.split():
    if len(i) > 2:
        df["first_name"] = df.staff.str.title().str.split().str[:2]
        df["last_name"] = df.staff.str.title().str.split().str[-1]
        
    else:
        df["first_name"] = df.staff.str.title().str.split().str[0]
        df["last_name"] = df.staff.str.title().str.split().str[1]

**str.find** returns lowest indexes in each strings in the Series/Index. Each of returned indexes corresponds to the position where the substring is fully contained between [start:end]. Return -1 on failure. Equivalent to standard str.find().

**str.rfind** returns highest indexes in each strings in the Series/Index. Each of returned indexes corresponds to the position where the substring is fully contained between [start:end]. Return -1 on failure. Equivalent to standard str.rfind().

In [64]:
df.job.str.find("developer")
# -1 döndürdüğü yerlede yok demek. 4 ve 5 rowlardaki 8-9 ise ilk d'nin gectigi index no

0   -1
1   -1
2   -1
3   -1
4    8
5    9
6   -1
7   -1
Name: job, dtype: int64

**str.findall** finds all occurrences of pattern or regular expression in the Series/Index [SOURCE](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.findall.html).

In [65]:
df.job.str.findall("d")
# kac tane geciyorsa o kadar liste icinde döndürür.

0        []
1        []
2       [d]
3        []
4    [d, d]
5    [d, d]
6       [d]
7       [d]
Name: job, dtype: object

In [66]:
# uzunluguna bakabiliriz apply ve len ile
df.job.str.findall("d").apply(len)

0    0
1    0
2    1
3    0
4    2
5    2
6    1
7    1
Name: job, dtype: int64

# Join()

In [12]:
df["skills"] = [[],["Java","C++"],["Python","Tableau","SQL"],[],["React","Django"],["JavaScript","Python"],["R","SQL"],["SQL","Python"]]
df["Skills"] = [[],[],["Python","Tableau","SQL"],[],["React","Django"],["JavaScript","Python"],["R","SQL"],["SQL","Python"]]
df.loc[1, "Skills"] = "Java,C++"
df

Unnamed: 0,id,staff,department,job,salary,age,first_name,last_name,skills,Skills
0,M0001,Tom BLUE,HR,manager,"""$150,000""",52,Tom,Blue,[],[]
1,M0002,JOHN BLACK,IT,manager,"""$180,000""",48,John,Black,"[Java, C++]","Java,C++"
2,E0001,Micheal Brown,IT,data scientist,"""$150,000""",35,Micheal,Brown,"[Python, Tableau, SQL]","[Python, Tableau, SQL]"
3,E0002,jason walker,HR,recruiter,130000dolar,38,Jason,Walker,[],[]
4,E0003,Alex Green,IT,backend developer,"""$110,000""",-,Alex,Green,"[React, Django]","[React, Django]"
5,E0004,OSCAR SMİTH,IT,frontend developer,"""$120,000""",32,Oscar,Smi̇th,"[JavaScript, Python]","[JavaScript, Python]"
6,E0005,Adrian STAR,IT,data scientist,"""$135,000""",40,Adrian,Star,"[R, SQL]","[R, SQL]"
7,E0006,Albert simon,IT,data scientist,125000dolar,35,Albert,Simon,"[SQL, Python]","[SQL, Python]"


In [13]:
# iterableın her bir elemanını virgülle birlestirip string yapalım. listeden ve [, ] isaretlerinden temizleyelim
df.skills.str.join(",")


0                      
1              Java,C++
2    Python,Tableau,SQL
3                      
4          React,Django
5     JavaScript,Python
6                 R,SQL
7            SQL,Python
Name: skills, dtype: object

In [69]:
df.skills.str.join(",")[2]

'Python,Tableau,SQL'

In [14]:
# diğer Skills sutununa yapalım
df.Skills.str.join(",")
# 1'de istemedigimiz bir degisiklik yapti. tum harfleri ayirdi. liste disinda direkt string olanları da ayırdı
# skills sutununda hepsi [] icindeydi, ama burada olmayanlar da var, olmayanları harf harf virgulle ayirdi 
# bununn icin tek tek satırları kontrol etmemiz gerekiyor unique ile bakabiliriz durumları nasıl diye

0                      
1       J,a,v,a,,,C,+,+
2    Python,Tableau,SQL
3                      
4          React,Django
5     JavaScript,Python
6                 R,SQL
7            SQL,Python
Name: Skills, dtype: object

In [15]:
# normal built in join
",".join("galatasaray")

'g,a,l,a,t,a,s,a,r,a,y'

In [16]:
",".join(["galatasaray"])

'galatasaray'

In [17]:
",".join(["ali", "veli", "deli"])
# listeden cikardi virgulle birlestirdi

'ali,veli,deli'

In [18]:
# dolayısıyla farklılasan durumlar icin apply fonskiyonu ile bir fonksiyon uygularız sutun geneline
df.Skills.apply(lambda x: ",".join(x) if type(x) == list else x)
# listeyse join et, stringse dokunma

0                      
1              Java,C++
2    Python,Tableau,SQL
3                      
4          React,Django
5     JavaScript,Python
6                 R,SQL
7            SQL,Python
Name: Skills, dtype: object

In [74]:
# veya list comp
[",".join(x) if type(x) == list else x for x in df.Skills]

['',
 'Java,C++',
 'Python,Tableau,SQL',
 '',
 'React,Django',
 'JavaScript,Python',
 'R,SQL',
 'SQL,Python']

In [19]:
df["Skills"] = [",".join(x) if type(x) == list else x for x in df.Skills]

If the elements of a Series are lists themselves, join the content of these lists using the delimiter passed to the function. This function is an equivalent to str.join() [SOURCE](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.join.html).

**Join** lists contained as elements in the Series/Index with passed delimiter.

In [77]:
df

Unnamed: 0,id,department,job,salary,age,first_name,last_name,skills,Skills
0,M0001,HR,manager,150000,52.0,Tom,Blue,[],
1,M0002,IT,manager,180000,48.0,John,Black,"[Java, C++]","Java,C++"
2,E0001,IT,data scientist,150000,35.0,Micheal,Brown,"[Python, Tableau, SQL]","Python,Tableau,SQL"
3,E0002,HR,recruiter,130000,38.0,Jason,Walker,[],
4,E0003,IT,backend developer,110000,,Alex,Green,"[React, Django]","React,Django"
5,E0004,IT,frontend developer,120000,32.0,Oscar,Smi̇th,"[JavaScript, Python]","JavaScript,Python"
6,E0005,IT,data scientist,135000,40.0,Adrian,Star,"[R, SQL]","R,SQL"
7,E0006,IT,data scientist,125000,35.0,Albert,Simon,"[SQL, Python]","SQL,Python"


### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">Dummy Operations</p>

<a id="1.3"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

A dataset may contain various type of values, sometimes it consists of categorical values. So, in-order to use those categorical value for programming efficiently we create dummy variables. A dummy variable is a binary variable that indicates whether a separate categorical variable takes on a specific value [SOURCE](https://www.geeksforgeeks.org/how-to-create-dummy-variables-in-python-with-pandas/).

### get_dummies()

**Syntax1:** ``pd.get_dummies(data, prefix=None, prefix_sep="_",)``<br>
            **OR**<br>
**Syntax2:** ``df["col_name"].get_dummies(sep = ",")``

**Parameters:**
- data= input data i.e. it includes pandas data frame. list . set . numpy arrays etc.
- prefix= Initial value
- prefix_sep= Data values separation.
- Return Type: Dummy variables.

![image.png](attachment:7a2b92b7-039c-49c3-b3bd-89541b017df2.png)

In [None]:
# birbirine ustunlugu olmayan, ornegin araba renkleri gibi kategorik verilerde kullanırız. 

In [20]:
ser = pd.Series(["p|q", "p", "p|r"])
ser

0    p|q
1      p
2    p|r
dtype: object

In [21]:
# bunu sayısala cevirelim get_dummies ile
ser.str.get_dummies()
# p ve q 0'da oldugu icin 1 digeri 0; 1 rowda sadece p digerleri 0. 
# peki neden | kullanmadı. cunku doc str'de default sep = |
# Signature: ser.str.get_dummies(sep='|'). istersek bunu degistirebiliriz

Unnamed: 0,p,q,r
0,1,1,0
1,1,0,0
2,1,0,1


In [89]:
df.department
# bu nominal veri direkt

0    HR
1    IT
2    IT
3    HR
4    IT
5    IT
6    IT
7    IT
Name: department, dtype: object

In [22]:
# seri degil bu nedenle str.get_d degil df oldugu icin normal str'siz get_d uygulayabiliriz. yukardaki foto
pd.get_dummies(df.department)

Unnamed: 0,HR,IT
0,1,0
1,0,1
2,0,1
3,1,0
4,0,1
5,0,1
6,0,1
7,0,1


In [23]:
# drop_first'i dusurerek tek satıra indirebiliriz. sadece it'ye bakiyoruz orn. ya da burda 1 it, 0 ise it degil HR demek
pd.get_dummies(df.department, drop_first=True)
# 200bin satır gibi buyuk bir datada bir sutun azaltarak boylece hız ve memory kazancı olur
# datada 3 departman olsaydı nasıl olurdu: ornegin hr, it, marketing.. hr duser, it ve marketingden birinde 
# eger 1 varsa hr olmadigini anlariz. ama hem it hem marketing 0 ise demekki o hr

Unnamed: 0,IT
0,0
1,1
2,1
3,0
4,1
5,1
6,1
7,1


As you can see two(2) dummy variables are created for the three categorical values of the "department" attribute. We can create dummy variables in python using **``get_dummies()``** method.

Dummies with **``drop_first=True``** parameter can be used to drop the first column. drop_first=True is important to use, as it helps in reducing the extra column created during dummy variable creation. Hence it reduces the correlations created among dummy variables. In other words it drops the first dummy to avoid the creation of correlated features [SOURCE](https://stackoverflow.com/questions/63661560/drop-first-true-during-dummy-variable-creation-in-pandas#:~:text=1%20Answer,correlations%20created%20among%20dummy%20variables.).

In [93]:
df.Skills

0                      
1              Java,C++
2    Python,Tableau,SQL
3                      
4          React,Django
5     JavaScript,Python
6                 R,SQL
7            SQL,Python
Name: Skills, dtype: object

In [94]:
# bu elemanları ayıramak icin get_d
df.Skills.str.get_dummies(sep=",")
# unqiquelerden bir df olusturduk gibi.
# önemli: kategorik verilerle... cunku amac onları numeric hale getirmek.. ama cok fazla uniqueden olusan
# kategorikleri yapmamak lazım. bu sekilde oldugu gibi cok buyuk oluyor. 1 sutundan 20 sutun yapmak zorlastirir.

Unnamed: 0,C++,Django,Java,JavaScript,Python,R,React,SQL,Tableau
0,0,0,0,0,0,0,0,0,0
1,1,0,1,0,0,0,0,0,0
2,0,0,0,0,1,0,0,1,1
3,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,1,0,0
5,0,0,0,1,1,0,0,0,0
6,0,0,0,0,0,1,0,1,0
7,0,0,0,0,1,0,0,1,0


In [24]:
# hangisi orjinal sutundu hangisi dummies ile oldu ayırmak icin prefiks ekleyebiliriz sutun isimlerine
df.Skills.str.get_dummies(sep=",").add_prefix("Skills_")

Unnamed: 0,Skills_C++,Skills_Django,Skills_Java,Skills_JavaScript,Skills_Python,Skills_R,Skills_React,Skills_SQL,Skills_Tableau
0,0,0,0,0,0,0,0,0,0
1,1,0,1,0,0,0,0,0,0
2,0,0,0,0,1,0,0,1,1
3,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,1,0,0
5,0,0,0,1,1,0,0,0,0
6,0,0,0,0,0,1,0,1,0
7,0,0,0,0,1,0,0,1,0


**ONEMLİ** 
1) Eger sadece pd.get_dummies(df.Skills) veya df.Skills.str.get_dummies() dersek, yani separator parametresini "," olarak ayarlamazsak hucre bazlı calisir ve hucredeki degerleri ayırmayıp tüm hücreyi bir sutun yapar(foto).

2) Yeni olusan df'te ilk sutun isimsiz, ismi space. eger str.get_dummies seklinde demeyip islemi sadece str'li olan hucrelerle kisitlamazsak ve bunun yerine pd.get_dummies(df.Skills) dersek, icinde text olmayan sadece bosluk olan hucreleri de ayrı bir sutun olarak ayırır.

3) Boyle bir series icin dogru kullanım df.Skills.str.get_dummies(sep=",")

![image.png](attachment:387bf0a8-54a9-4c6b-a78a-868a56407a06.png)

In [25]:
Skills_dummy = df.Skills.str.get_dummies(sep=",").add_prefix("Skills_")

In [26]:
df
# kullanmayacagim gerkesiz sutunları atalım, cunku skills_dummy ile birlestirince cok buyuyecek
# id'ler, first ve last nameler gibi sutunları modelimize cok katkisi yok dusuncesiyle almayalım.

Unnamed: 0,id,staff,department,job,salary,age,first_name,last_name,skills,Skills
0,M0001,Tom BLUE,HR,manager,"""$150,000""",52,Tom,Blue,[],
1,M0002,JOHN BLACK,IT,manager,"""$180,000""",48,John,Black,"[Java, C++]","Java,C++"
2,E0001,Micheal Brown,IT,data scientist,"""$150,000""",35,Micheal,Brown,"[Python, Tableau, SQL]","Python,Tableau,SQL"
3,E0002,jason walker,HR,recruiter,130000dolar,38,Jason,Walker,[],
4,E0003,Alex Green,IT,backend developer,"""$110,000""",-,Alex,Green,"[React, Django]","React,Django"
5,E0004,OSCAR SMİTH,IT,frontend developer,"""$120,000""",32,Oscar,Smi̇th,"[JavaScript, Python]","JavaScript,Python"
6,E0005,Adrian STAR,IT,data scientist,"""$135,000""",40,Adrian,Star,"[R, SQL]","R,SQL"
7,E0006,Albert simon,IT,data scientist,125000dolar,35,Albert,Simon,"[SQL, Python]","SQL,Python"


In [27]:
df_final = df[["department", "job", "salary", "skills"]]

In [28]:
df_final

Unnamed: 0,department,job,salary,skills
0,HR,manager,"""$150,000""",[]
1,IT,manager,"""$180,000""","[Java, C++]"
2,IT,data scientist,"""$150,000""","[Python, Tableau, SQL]"
3,HR,recruiter,130000dolar,[]
4,IT,backend developer,"""$110,000""","[React, Django]"
5,IT,frontend developer,"""$120,000""","[JavaScript, Python]"
6,IT,data scientist,"""$135,000""","[R, SQL]"
7,IT,data scientist,125000dolar,"[SQL, Python]"


In [29]:
# dummy sutunlarını ekleyellim simdi buna
df_final = df_final.join(Skills_dummy)

In [30]:
df_final

Unnamed: 0,department,job,salary,skills,Skills_C++,Skills_Django,Skills_Java,Skills_JavaScript,Skills_Python,Skills_R,Skills_React,Skills_SQL,Skills_Tableau
0,HR,manager,"""$150,000""",[],0,0,0,0,0,0,0,0,0
1,IT,manager,"""$180,000""","[Java, C++]",1,0,1,0,0,0,0,0,0
2,IT,data scientist,"""$150,000""","[Python, Tableau, SQL]",0,0,0,0,1,0,0,1,1
3,HR,recruiter,130000dolar,[],0,0,0,0,0,0,0,0,0
4,IT,backend developer,"""$110,000""","[React, Django]",0,1,0,0,0,0,1,0,0
5,IT,frontend developer,"""$120,000""","[JavaScript, Python]",0,0,0,1,1,0,0,0,0
6,IT,data scientist,"""$135,000""","[R, SQL]",0,0,0,0,0,1,0,1,0
7,IT,data scientist,125000dolar,"[SQL, Python]",0,0,0,0,1,0,0,1,0


In [31]:
df_final.drop("skills", axis=1, inplace=True)

In [109]:
df_final

Unnamed: 0,department,job,salary,Skills_C++,Skills_Django,Skills_Java,Skills_JavaScript,Skills_Python,Skills_R,Skills_React,Skills_SQL,Skills_Tableau
0,HR,manager,150000,0,0,0,0,0,0,0,0,0
1,IT,manager,180000,1,0,1,0,0,0,0,0,0
2,IT,data scientist,150000,0,0,0,0,1,0,0,1,1
3,HR,recruiter,130000,0,0,0,0,0,0,0,0,0
4,IT,backend developer,110000,0,1,0,0,0,0,1,0,0
5,IT,frontend developer,120000,0,0,0,1,1,0,0,0,0
6,IT,data scientist,135000,0,0,0,0,0,1,0,1,0
7,IT,data scientist,125000,0,0,0,0,1,0,0,1,0


In [None]:
# kalan kategorikleri de cevirmek icin pandas get_dummiese atarız tamamını kendisi yapar

In [32]:
pd.get_dummies(df_final, drop_first=True)
# prefiksleri kendsi verdi otomatik yeni olusan sutunlara bakarsak. yani bizim manuelle yaptigimizi o yapıyor zatenn.

Unnamed: 0,Skills_C++,Skills_Django,Skills_Java,Skills_JavaScript,Skills_Python,Skills_R,Skills_React,Skills_SQL,Skills_Tableau,department_IT,job_data scientist,job_frontend developer,job_manager,job_recruiter,"salary_""$120,000""","salary_""$135,000""","salary_""$150,000""","salary_""$180,000""",salary_125000dolar,salary_130000dolar
0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0
1,1,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0
2,0,0,0,0,1,0,0,1,1,1,1,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1
4,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
5,0,0,0,1,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0
6,0,0,0,0,0,1,0,1,0,1,1,0,0,0,0,1,0,0,0,0
7,0,0,0,0,1,0,0,1,0,1,1,0,0,0,0,0,0,0,1,0


In [None]:
# hepsini kategorik yapacaksa once datframedeki sutunlarımı kategoriklige uygun hale getirmem lazım demekki. yukardki
# örnek zaten hazırdı. dolayısıyla once cleaning, missing values handling vb islemleri yapmamız lazım

# ayrıca bazı sutunlara get_d yapılmasını istemeyebliriz. ornegin id sutunundan 100 satır gelirdi. onları silmek lazım

# her feature'ın model icin önemi farklıdır. dolayısıyla bazen sadece feature importance ile belli sutunları 
# get_dummiese almalıyız. yani get_dummies oncesi iyi ve detayli bir calisma lazım.

# another example

In [33]:
df = pd.DataFrame([('Foreign Cinema', 289.0),
                   ('Liho Liho', 224.0),
                   ('500 Club', 80.5),
                   ('Foreign Cinema', 25.30)],
           columns=('name', 'Amount')
                 )

df

Unnamed: 0,name,Amount
0,Foreign Cinema,289.0
1,Liho Liho,224.0
2,500 Club,80.5
3,Foreign Cinema,25.3


**1. Creating Dummy Indicator columns**

To create dummy columns, I need to tell pandas which DataFrame I want to use, and which columns I want to create dummies on. Here I want to create dummies on the 'name' column.

In [34]:
pd.get_dummies(df, columns= ["name"])
# column isminin [ ] icinde olmasina dikkat

Unnamed: 0,Amount,name_500 Club,name_Foreign Cinema,name_Liho Liho
0,289.0,0,1,0
1,224.0,0,0,1
2,80.5,1,0,0
3,25.3,0,1,0


Notice how there are 3 new columns, one for every disticnt value within our old 'name' column. Within these new columns is a list of 1s and 0s showing if the previous row had the column value.

**2. Creating Dummy Indicator columns with prefix**

See how above all of my new columns start with "name_"? Well I don't like it. I want to switch the prefix to something else. You can do this by specifying "prefix" parameter.

In [35]:
pd.get_dummies(df, columns=["name"], prefix= "dmy")

Unnamed: 0,Amount,dmy_500 Club,dmy_Foreign Cinema,dmy_Liho Liho
0,289.0,0,1,0
1,224.0,0,0,1
2,80.5,1,0,0
3,25.3,0,1,0


In [151]:
# The _ that is in the middle of my prefix and column names. I'll switch it to a * by specifying the prefix_sep.*

pd.get_dummies(df, columns=["name"], prefix = "dmy", prefix_sep = "*")

Unnamed: 0,Amount,dmy*500 Club,dmy*Foreign Cinema,dmy*Liho Liho
0,289.0,0,1,0
1,224.0,0,0,1
2,80.5,1,0,0
3,25.3,0,1,0


**3. Creating Dummy Indicator columns and dropping the first variable**

Notice above, how every new dummy column has at least one "1" within it? This is because every variable is accounted for with a True (1) indicator. However, what if a row was all 0s? This is also a way to identify one of your values. drop_first allows you to drop your first variable and identify it through all other columns being 0.

Notice how "500 Club" column has been removed, and where the "500 Club" row use to be, remains 0s in both "Foreign Cinema" and "Liho Liho".

In [36]:
pd.get_dummies(df, columns=['name'], drop_first=True)

Unnamed: 0,Amount,name_Foreign Cinema,name_Liho Liho
0,289.0,1,0
1,224.0,0,1
2,80.5,0,0
3,25.3,1,0


## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Working with Time Data</p>

<a id="2"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

Pandas Python package is extremely useful for time series manipulation and analysis:
- create a date range
- work with timestamp data
- convert string data to a timestamp
- index and slice your time series data in a data frame
- resample your time series for different time period aggregates/summary statistics
- compute a rolling statistic such as a rolling average
- work with missing data
- understand the basics of unix/epoch time
- understand common pitfalls of time series data analysis [SOURCE](https://towardsdatascience.com/basic-time-series-manipulation-with-pandas-4432afee64ea)

This short section is by no means a complete guide to the time series tools available in Python or Pandas, but instead is intended as a broad overview of how you as a user should approach working with time series [SOURCE](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html).

In [37]:
import pandas as pd
df = pd.read_csv("time_exercise.csv")
df.head()

Unnamed: 0,id_product,order_date,product_quantity,product_price,entry_date
0,401,2021-01-23,1.0,541.487603,2018-12-04
1,416,2020-04-02,1.0,131.181818,2018-12-04
2,717,2019-03-10,1.0,2035.4885,2018-12-04
3,778,2019-12-27,1.0,335.988,2018-12-04
4,826,2020-02-19,1.0,342.292302,2018-12-04


In [38]:
df.info()
# iki tarih sutunu var ama object gorunuyorlar. datetime'a cevirerek datetime metotlarını uygulayabiliriz ancak

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 911 entries, 0 to 910
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id_product        911 non-null    int64  
 1   order_date        911 non-null    object 
 2   product_quantity  911 non-null    float64
 3   product_price     911 non-null    float64
 4   entry_date        911 non-null    object 
dtypes: float64(2), int64(1), object(2)
memory usage: 35.7+ KB


In [113]:
df["order_date"]

0      2021-01-23
1      2020-04-02
2      2019-03-10
3      2019-12-27
4      2020-02-19
          ...    
906    2020-11-24
907    2020-11-24
908    2020-11-22
909    2021-01-26
910    2020-12-06
Name: order_date, Length: 911, dtype: object

### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">pd.to_datetime()</p>

<a id="2.1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

For more and detailed information about to_datetime() metod, please [Visit Official Document](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html)

**``pd.to_datetime()``** Converts argument to datetime.

This function converts a **``scalar``**, **``array-like``**, **``Series``** or **``DataFrame/dict-like``** to a pandas datetime object.

**As stated above, many input types are supported, and lead to different output types:**

- **``scalars``** can be int, float, str, datetime object (from stdlib datetime module or numpy). They are converted to Timestamp when possible, otherwise they are converted to datetime.datetime. None/NaN/null scalars are converted to NaT.

- **``array-like``** can contain int, float, str, datetime objects. They are converted to DatetimeIndex when possible, otherwise they are converted to Index with object dtype, containing datetime.datetime. None/NaN/null entries are converted to NaT in both cases.

- **``Series``** are converted to Series with datetime64 dtype when possible, otherwise they are converted to Series with object dtype, containing datetime.datetime. None/NaN/null entries are converted to NaT in both cases.

- **``DataFrame/dict-like``** are converted to Series with datetime64 dtype. For each row a datetime is created from assembling the various dataframe columns. Column keys can be common abbreviations like [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) or plurals of the same.

[Special Note :](https://pandas.pydata.org/docs/getting_started/intro_tutorials/09_timeseries.html)

As many data sets do contain datetime information in one of the columns, pandas input function like [pandas.read_csv()](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv) and [pandas.read_json()](https://pandas.pydata.org/docs/reference/api/pandas.read_json.html#pandas.read_json) can do the transformation to dates when reading the data using the **``parse_dates parameter``** with a list of the columns to read as Timestamp.

![image.png](attachment:9624f01b-ea8c-4197-bf16-09a735de4550.png)

Why are these **``pandas.Timestamp``** objects useful? Let's illustrate the added value with some example cases. In this sense, let us assume that we want to work with the dates in the column datetime as datetime objects instead of plain text:

In [39]:
df["order_date"] = pd.to_datetime(df["order_date"])
df["entry_date"] = pd.to_datetime(df["entry_date"])

In [132]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 911 entries, 0 to 910
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   id_product        911 non-null    int64         
 1   order_date        911 non-null    datetime64[ns]
 2   product_quantity  911 non-null    float64       
 3   product_price     911 non-null    float64       
 4   entry_date        911 non-null    datetime64[ns]
dtypes: datetime64[ns](2), float64(2), int64(1)
memory usage: 35.7 KB


In [40]:
# sadece yılları alalım
df["entry_date"].dt.year
# str gibi burda da dt var

0      2018
1      2018
2      2018
3      2018
4      2018
       ... 
906    2020
907    2020
908    2020
909    2020
910    2020
Name: entry_date, Length: 911, dtype: int64

In [5]:
# sadece ayları alalım
df["entry_date"].dt.month

0      12
1      12
2      12
3      12
4      12
       ..
906    10
907    10
908    11
909    11
910    11
Name: entry_date, Length: 911, dtype: int64

In [6]:
df["entry_date"].dt.day
# weekday, week vb yazılabilir

0       4
1       4
2       4
3       4
4       4
       ..
906     7
907     7
908    13
909    24
910    26
Name: entry_date, Length: 911, dtype: int64

In [7]:
df["entry_date"].dt.date

0      2018-12-04
1      2018-12-04
2      2018-12-04
3      2018-12-04
4      2018-12-04
          ...    
906    2020-10-07
907    2020-10-07
908    2020-11-13
909    2020-11-24
910    2020-11-26
Name: entry_date, Length: 911, dtype: object

In [5]:
df.order_date.dt.day_name()
# df.order_date.dt.month_name()

0       Saturday
1       Thursday
2         Sunday
3         Friday
4      Wednesday
         ...    
906      Tuesday
907      Tuesday
908       Sunday
909      Tuesday
910       Sunday
Name: order_date, Length: 911, dtype: object

Initially, the values in datetime are character strings and do **NOT** provide any datetime operations (e.g. extract the year, day of the week,…). By applying the to_datetime function, pandas interprets the strings and convert these to datetime (i.e. ``datetime64[ns, UTC]``) objects. In pandas we call these datetime objects similar to datetime.datetime from the standard library as [pandas.Timestamp](https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html#pandas.Timestamp).

In [None]:
# PD.DATETIME PANDAS'IN BİR FONKSİYONU. BİR DE DATETIME DIYE BIR MODUL VAR PANDAS DISINDA TAMAMEN

Now let's apply some aggregate methods for Datatime object at the given dataset:

In [133]:
df.entry_date.min()

Timestamp('2018-12-04 00:00:00')

In [134]:
df.entry_date.max()

Timestamp('2020-11-26 00:00:00')

In [135]:
# bunlarda matematik islemleri de yapılabilir
df.entry_date.max() - df.entry_date.min()

Timedelta('723 days 00:00:00')

### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">Series.dt()</p>

<a id="2.2"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

Accessor object for datetimelike properties of the Series values [SOURCE](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.html).

For a comprehensive information what the datetimelike properties, please visit [Official Pandas API Reference Document](https://pandas.pydata.org/pandas-docs/version/0.22/api.html#datetimelike-properties)

In [41]:
a = pd.Series(["15-03-2020", "18-05-2019", "24-07-2018"])
a

0    15-03-2020
1    18-05-2019
2    24-07-2018
dtype: object

In [121]:
a.max(), a.min()

('24-07-2018', '15-03-2020')

In [42]:
a = pd.to_datetime(a, format = "%d-%m-%Y")
a

0   2020-03-15
1   2019-05-18
2   2018-07-24
dtype: datetime64[ns]

In [123]:
a.max()

Timestamp('2020-03-15 00:00:00')

In [124]:
a.min()

Timestamp('2018-07-24 00:00:00')

### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">Datetime Module</p>

<a id="2.3"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

The datetime module supplies classes for manipulating dates and times [SOURCE](https://docs.python.org/3/library/datetime.html).

### ``class datetime.datetime``

A combination of a date and a time. Attributes: year, month, day, hour, minute, second, microsecond, and tzinfo.

In [43]:
from datetime import datetime
# datetime modulünü kullanmak icin once import etmeliyiz
# datetime modulunde 6 class var, datetime ve timedelta bunlardan bize gerekli ikisi

In [44]:
datetime.now()

datetime.datetime(2023, 2, 15, 22, 53, 2, 830232)

In [45]:
print(datetime.now())
# print icinde daha duzgun gosterir.

2023-02-15 22:53:09.233195


In [11]:
type(datetime.now())

datetime.datetime

In [47]:
# bunu calistirdigimiz ana sabitlemek icin bir variablea atarız
current_datetime = datetime.now()


In [15]:
print(current_datetime)

2022-12-17 09:23:45.682177


In [48]:
# degiskenin tipi datetime, bu nedenle datetime metotları uygulanabilir
current_datetime.year
# pandas olmadigi icin dt yazmıyoruz artı

2023

In [49]:
current_datetime.month

2

In [50]:
current_datetime.weekday()
# weekday method olarak with ()

2

In [20]:
current_datetime.isoweekday()

6

In [21]:
current_datetime.minute

23

In [None]:
# datetime modulunde iken bir pandas seriesine datetime modulu uygulkamak istiyorsak ya seriyi for dongusune sokarız
# ya da apply fonksiyonu uygualarız. normal string fonksiyonları ile pandas str metotlarının farkı gibi

### ``class datetime.timedelta``

A duration expressing the difference between two date, time, or datetime instances to microsecond resolution [SOURCE](https://www.geeksforgeeks.org/manipulate-date-and-time-with-the-datetime-module-in-python/).

In [51]:
from datetime import timedelta
# timedelta classını import edelim

In [52]:
timedelta(days=2)
# days parametresi icindeki bir zaman araligi belirliyoruz, hafta, saat, saniye de girilebilir

datetime.timedelta(days=2)

In [53]:
current_datetime

datetime.datetime(2023, 2, 15, 22, 53, 38, 419142)

In [54]:
two_days_before = current_datetime - timedelta(days=2)
two_days_before
# farzedelim ki tüm verilerin bir ay onceki hallerini gormek istiyoruz, bu yontemle yaparız. ya da farklı zaman
# dilimlerindeki satış farklarını vs karsilamak icin

datetime.datetime(2023, 2, 13, 22, 53, 38, 419142)

In [28]:
datetime.now() + timedelta(weeks=3, days = 4, hours = 6, minutes = 20)

datetime.datetime(2023, 1, 11, 15, 52, 32, 880236)

In [29]:
print(datetime.now() + timedelta(weeks=3, days = 4, hours = 6, minutes = 20))

2023-01-11 15:52:44.083513


In [31]:
print(f"{'current_date' : <15}", datetime.now())

print(f"{'plus' : <15}", timedelta(weeks=3, days = 4, hours = 6, minutes = 20))

print(f"{'total' : <15}", datetime.now() + timedelta(weeks=3, days = 4, hours = 6, minutes = 20))

current_date    2022-12-17 09:36:34.476146
plus            25 days, 6:20:00
total           2023-01-11 15:56:34.476592


### ``strftime()``

**Converting** from date/datetime/timedelta object **to string type** [SOURCE](https://strftime.org/)

In [32]:
print(current_datetime)

2022-12-17 09:23:45.682177


In [33]:
current_datetime.year

2022

In [34]:
type(current_datetime.year)
# ONEMLİ: tipi integer. bunu str olarak kullanmak icin strftime

int

In [35]:
current_datetime.strftime("%Y")

'2022'

**Watch out the difference.**

In [36]:
type(current_datetime.strftime("%Y"))
# tipi string oldu

str

In [37]:
print("current_datetime :", current_datetime)

year = current_datetime.strftime("%Y")
print("year:", year)

month = current_datetime.strftime("%m")
print("month:", month)

day = current_datetime.strftime("%d")
print("day:", day)

time = current_datetime.strftime("%H:%M:%S")
print("time:", time)

date_time = current_datetime.strftime("%m/%d/%Y, %H:%M:%S")
print("date and time:", date_time)

current_datetime : 2022-12-17 09:23:45.682177
year: 2022
month: 12
day: 17
time: 09:23:45
date and time: 12/17/2022, 09:23:45


### strptime()

**Converting** from string type **to datetime object**

![image.png](attachment:1d7bfef8-0dbd-489c-b056-bfe203a790da.png)

In [38]:
# strptime da str olarak farklı formatlarda patternlerde girilen tarihi datetime'a donusturuyor

date_string = "21 June, 2018"
date_string

'21 June, 2018'

In [39]:
# str olan bu tarihi datetime'a cevirelim. onemli olan, farklı sekilde girilmis tarih formatlarını once ayarlamak
datetime.strptime(date_string, "%d %B, %Y")
# bize verilenle aynı patterne uydurmak icin %B veya virgul gibi kullandık

datetime.datetime(2018, 6, 21, 0, 0)

In [40]:
type(datetime.strptime(date_string, "%d %B, %Y"))

datetime.datetime

In [41]:
# pandasta nasıl yapıyorduk cevrmeyi:
pd.to_datetime(date_string)

Timestamp('2018-06-21 00:00:00')

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Operation with Datetime Object</p>

<a id="3"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

## Let's detect the time between first order date and entry date for each product

In [42]:
df.head()

Unnamed: 0,id_product,order_date,product_quantity,product_price,entry_date
0,401,2021-01-23,1.0,541.487603,2018-12-04
1,416,2020-04-02,1.0,131.181818,2018-12-04
2,717,2019-03-10,1.0,2035.4885,2018-12-04
3,778,2019-12-27,1.0,335.988,2018-12-04
4,826,2020-02-19,1.0,342.292302,2018-12-04


In [43]:
# task: productun sisteme giris ve ilk satıs tarihi arasındaki farkı bul

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 911 entries, 0 to 910
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   id_product        911 non-null    int64         
 1   order_date        911 non-null    datetime64[ns]
 2   product_quantity  911 non-null    float64       
 3   product_price     911 non-null    float64       
 4   entry_date        911 non-null    datetime64[ns]
dtypes: datetime64[ns](2), float64(2), int64(1)
memory usage: 35.7 KB


**Let us do it by string methods**

In [56]:
# bunu soyle de halledebilirdik basitce
df.order_date - df.entry_date

0     781 days
1     485 days
2      96 days
3     388 days
4     442 days
        ...   
906    48 days
907    48 days
908     9 days
909    63 days
910    10 days
Length: 911, dtype: timedelta64[ns]

In [44]:
df.entry_date = pd.to_datetime(df.entry_date)
df.order_date = pd.to_datetime(df.order_date)
# once object datetimea cevrilir

In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 911 entries, 0 to 910
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   id_product        911 non-null    int64         
 1   order_date        911 non-null    datetime64[ns]
 2   product_quantity  911 non-null    float64       
 3   product_price     911 non-null    float64       
 4   entry_date        911 non-null    datetime64[ns]
dtypes: datetime64[ns](2), float64(2), int64(1)
memory usage: 35.7 KB


In [46]:
df.order_date - df.entry_date

0     781 days
1     485 days
2      96 days
3     388 days
4     442 days
        ...   
906    48 days
907    48 days
908     9 days
909    63 days
910    10 days
Length: 911, dtype: timedelta64[ns]

In [47]:
# her satıs, her order icin giris ve satis tarihi farkı bu yukardaki.
df["time_delta"] = df.order_date - df.entry_date

In [48]:
df.head(3)

Unnamed: 0,id_product,order_date,product_quantity,product_price,entry_date,time_delta
0,401,2021-01-23,1.0,541.487603,2018-12-04,781 days
1,416,2020-04-02,1.0,131.181818,2018-12-04,485 days
2,717,2019-03-10,1.0,2035.4885,2018-12-04,96 days


In [51]:
# dayslerden kurtaralım
df["time_delta"].astype("str").str.split(" ").str[0].astype(int)
# en son integera cevirdik

0      781
1      485
2       96
3      388
4      442
      ... 
906     48
907     48
908      9
909     63
910     10
Name: time_delta, Length: 911, dtype: int64

In [52]:
df.time_delta = df["time_delta"].astype("str").str.split(" ").str[0].astype(int)

In [53]:
df.head(3)

Unnamed: 0,id_product,order_date,product_quantity,product_price,entry_date,time_delta
0,401,2021-01-23,1.0,541.487603,2018-12-04,781
1,416,2020-04-02,1.0,131.181818,2018-12-04,485
2,717,2019-03-10,1.0,2035.4885,2018-12-04,96


In [57]:
df.groupby("id_product")["time_delta"].min()

id_product
401        781
416        485
717         96
778        388
826        442
          ... 
1536841     38
1536842     48
1536887      9
1536952     63
1536974     10
Name: time_delta, Length: 498, dtype: int64

In [59]:
# bu veriler product cesidi (aynı id farklı satıslar var) bazında, tum sutuna yayalım
df.groupby("id_product")["time_delta"].transform(min)

0      781
1      485
2       96
3      388
4      442
      ... 
906     48
907     48
908      9
909     63
910     10
Name: time_delta, Length: 911, dtype: int64

In [61]:
# bunu bir sutuna atayalım
df["passing_time_to_firstsale"] = df.groupby("id_product")["time_delta"].transform(min)

In [65]:
# 3669 product idli saırları getir
df[df["id_product"] == 3669]
# en sondaki grup icin minimum deger

Unnamed: 0,id_product,order_date,product_quantity,product_price,entry_date,time_delta,passing_time_to_firstsale
80,3669,2019-12-11,1.0,227.988,2018-12-04,372,372
81,3669,2020-10-18,1.0,227.988,2018-12-04,684,372
82,3669,2020-11-18,2.0,227.988,2018-12-04,715,372


## Let's detect the time between last order date and today for each product

In [67]:
df.groupby("id_product").order_date.max()
# her bir ürünün son siparis tarihi

id_product
401       2021-01-23
416       2020-04-02
717       2019-03-10
778       2019-12-27
826       2020-02-19
             ...    
1536841   2020-11-22
1536842   2020-11-24
1536887   2020-11-22
1536952   2021-01-26
1536974   2020-12-06
Name: order_date, Length: 498, dtype: datetime64[ns]

In [68]:
df.groupby("id_product").order_date.transform(max)
# her satırdaki ürünün son order date'i

0     2021-01-23
1     2020-04-02
2     2019-03-10
3     2019-12-27
4     2020-02-19
         ...    
906   2020-11-24
907   2020-11-24
908   2020-11-22
909   2021-01-26
910   2020-12-06
Name: order_date, Length: 911, dtype: datetime64[ns]

In [69]:
# bunda gerek yok ama yukarda zaman saniye falan varsa bunla sadece date'i alırız
df.groupby("id_product").order_date.transform(max).dt.date

0      2021-01-23
1      2020-04-02
2      2019-03-10
3      2019-12-27
4      2020-02-19
          ...    
906    2020-11-24
907    2020-11-24
908    2020-11-22
909    2021-01-26
910    2020-12-06
Name: order_date, Length: 911, dtype: object

In [70]:
last_order_date = df.groupby("id_product").order_date.transform(max).dt.date

In [76]:
today = pd.to_datetime("17-12-2022", format = "%d-%m-%Y")
print(today)

2022-12-17 00:00:00


In [77]:
today = today.date()
# zaten pd.to_datetime ile datetimea cevirdigimiz icin dt'ye gerek yok

In [78]:
print(today)

2022-12-17


In [79]:
df["passing_time_from_lastsale"] = today - last_order_date

In [80]:
df.head(3)

Unnamed: 0,id_product,order_date,product_quantity,product_price,entry_date,time_delta,passing_time_to_firstsale,passing_time_from_lastsale
0,401,2021-01-23,1.0,541.487603,2018-12-04,781,781,693 days
1,416,2020-04-02,1.0,131.181818,2018-12-04,485,485,989 days
2,717,2019-03-10,1.0,2035.4885,2018-12-04,96,96,1378 days


In [81]:
df["passing_time_from_lastsale"] = df["passing_time_from_lastsale"].astype("str").str.split(" ").str[0].astype(int)

In [82]:
df.passing_time_from_lastsale

0       693
1       989
2      1378
3      1086
4      1032
       ... 
906     753
907     753
908     755
909     690
910     741
Name: passing_time_from_lastsale, Length: 911, dtype: int64