In [2]:
import numpy as np
np.random.seed(12345)
np.set_printoptions(precision=4, suppress=True)

## Dates and times (Lec)
The built-in Python datetime module provides datetime, date, and time types. The datetime type combines the information stored in date and time and is the most commonly used:

In [4]:
from datetime import datetime, date, time
dt = datetime(2011, 10, 29, 20, 30, 21)
dt

datetime.datetime(2011, 10, 29, 20, 30, 21)

In [None]:
dt

2011-10-29 20:30:21


In [5]:

dt.day

29

In [6]:
dt.minute

30

In [7]:
dt

datetime.datetime(2011, 10, 29, 20, 30, 21)

In [8]:
dt.date()

datetime.date(2011, 10, 29)

In [9]:
dt.time()

datetime.time(20, 30, 21)

In [None]:
print(dt.date())
print(dt.time())

2011-10-29
20:30:21


The `strftime` method formats a datetime as a string:

|Type	|Description |
|:---- | :------------|
|%Y|	Four-digit year|
|%y|	Two-digit year|
|%m	|Two-digit month [01, 12]|
|%d|	Two-digit day [01, 31] |
|%H|	Hour (24-hour clock) [00, 23] |
|%I|	Hour (12-hour clock) [01, 12] |
|%M|	Two-digit minute [00, 59] |
|%S|	Second [00, 61] (seconds 60, 61 account for leap seconds) |

In [10]:
strdt= dt.strftime("%Y-%m-%d %H:%M")
strdt

'2011-10-29 20:30'

In [11]:
dt

datetime.datetime(2011, 10, 29, 20, 30, 21)

Strings can be converted (parsed) into datetime objects with the strptime function:

In [12]:
datetime.strptime("20091031", "%Y%m%d")

datetime.datetime(2009, 10, 31, 0, 0)

When you are aggregating or otherwise grouping time series data, it will occasionally be useful to replace time fields of a series of datetimes—for example, replacing the minute and second fields with zero:

In [13]:
dt_hour = dt.replace(minute=0, second=0) # dt is immutable. dt is not modified. 
dt_hour

datetime.datetime(2011, 10, 29, 20, 0)

Since datetime.datetime is an immutable type, methods like these always produce new objects. So in the previous example, dt is not modified by replace:

In [14]:
dt

datetime.datetime(2011, 10, 29, 20, 30, 21)

The difference of two datetime objects produces a datetime.timedelta type:

In [15]:
dt2 = datetime(2011, 11, 15, 22, 30)
delta = dt2 - dt
print(delta)
type(delta)

17 days, 1:59:39


datetime.timedelta

In [16]:
delta

datetime.timedelta(days=17, seconds=7179)

In [17]:
print(dt)
dt + delta

2011-10-29 20:30:21


datetime.datetime(2011, 11, 15, 22, 30)

### Bytes and Unicodes  (lightly lec)

Unicode is a standardized character encoding system that assigns a unique numerical value (code point) to virtually every character from every writing system in the world, including letters, numbers, symbols, and even emojis. It aims to provide a consistent way to represent text regardless of the language or script being used. Unicode enables computers and software to display and manipulate text in various languages and scripts, making it a fundamental component for internationalization and multilingual support in software and on the internet.

A code point is a unique numerical value or identifier that is assigned to each character, symbol, or element in a character encoding standard, such as Unicode. Code points are used to represent and differentiate individual characters from various writing systems, languages, and symbol sets in digital text.

For example, in Unicode:

The code point U+0041 represents the uppercase letter "A."
The code point U+03B1 represents the Greek letter alpha (α).
The code point U+1F600 represents the emoji "😀."

Now, let's talk about UTF-8 (Unicode Transformation Format - 8-bit), which is one of the most commonly used encoding schemes for representing Unicode characters in binary form:

Encoding: When text is stored or transmitted digitally, the characters are encoded into binary data using their respective code points. Different encoding schemes, such as UTF-8 or UTF-16 for Unicode, specify how code points are translated into binary data.

UTF-8 is a variable-length encoding scheme, meaning it uses a variable number of bytes to represent different characters. Here's how it works:

ASCII characters (U+0000 to U+007F): In UTF-8, these characters are represented using a single byte, which is the same as the ASCII encoding. This ensures backward compatibility with the original ASCII character set.

Commonly used characters in other scripts: Characters in the range U+0080 to U+07FF are represented using two bytes in UTF-8. This includes characters from many European, Middle Eastern, and South Asian scripts.

Characters from a wide range of scripts and emojis: Characters in the range U+0800 to U+FFFF are represented using three bytes in UTF-8. This covers a broad spectrum of scripts used around the world.

Supplementary characters and less commonly used symbols: Characters in the range U+10000 to U+10FFFF are represented using four bytes in UTF-8. This allows UTF-8 to support less common or historic scripts and symbols.

UTF-8's variable-length design is space-efficient because it uses fewer bytes for common characters while still accommodating a vast range of characters from various writing systems. It is widely used in programming languages, web pages, and many operating systems as a standard for encoding text in a way that is both human-readable and universally compatible.

In summary, Unicode is a character encoding standard that provides a unique code point for each character in the world, and UTF-8 is one of the encoding schemes used to represent those code points in binary form, with variable-length encoding to efficiently cover a wide range of characters.

A "Unicode string" refers to a sequence of characters or text that is encoded using the Unicode character encoding standard. Unicode is a standardized system that assigns a unique numerical value (code point) to almost every character, symbol, and emoji from virtually every writing system and language in the world. A Unicode string, therefore, is a text string where each character is represented using Unicode code points, allowing it to accommodate a wide range of characters from different languages and scripts.

Key characteristics of a Unicode string include:

Universal Character Support: Unicode strings can represent characters from any writing system, including Latin, Greek, Cyrillic, Chinese, Japanese, Arabic, and many others. They also include symbols, punctuation, control characters, and emojis.

Encoding Formats: Unicode strings can be encoded in various formats, with UTF-8 and UTF-16 being two of the most commonly used encodings. These formats determine how the Unicode code points are represented as binary data for storage or transmission.

UTF-8: Uses a variable-length encoding, with characters represented using 1 to 4 bytes. It is space-efficient for characters from common scripts like Latin and can represent all Unicode characters.

UTF-16: Uses a fixed-length encoding, with characters represented using 2 bytes (16 bits) or 4 bytes (32 bits). It is commonly used in systems that require fixed-width character representation.

Multilingual Support: Unicode strings make it possible to work with text that includes characters from multiple languages and scripts within a single string. This is essential for internationalization and multilingual support in software applications.

Compatibility: Unicode is designed to be backward compatible with earlier character encodings like ASCII. This means that ASCII characters are a subset of Unicode, and you can seamlessly work with both ASCII and non-ASCII characters in a Unicode string.

Here's an example of a Unicode string:

```
unicode_str = "Hello, 你好, こんにちは, مرحبًا, नमस्ते, 😊"
```
In this example, the string contains greetings in English, Chinese, Japanese, Arabic, Hindi, and an emoji, all encoded using Unicode code points. Unicode strings are widely used in modern software development to ensure that text can be accurately represented and processed across different languages, scripts, and platforms, making them an essential part of internationalization and cross-cultural communication in computing.


In modern Python (i.e., Python 3.0 and up), Unicode has become the first-class string type to enable more consistent handling of ASCII and non-ASCII text. In older versions of Python, strings were all bytes without any explicit Unicode encoding. You could convert to Unicode assuming you knew the character encoding. Here is an example Unicode string with non-ASCII characters:

In [None]:
val = "español" #unicode string
val

'español'

We can convert this Unicode string to its `UTF-8 bytes representation` using the encode method:

In [None]:
val_utf8 = val.encode("utf-8")
print(val_utf8)
type(val_utf8)

b'espa\xc3\xb1ol'


bytes

Assuming you know the Unicode encoding of a bytes object, you can go back using the decode method:

In [None]:
val_utf8.decode("utf-8")

'español'

While it is now preferable to use UTF-8 for any encoding, for historical reasons you may encounter data in any number of different encodings:

In [None]:
print(val.encode("latin1"))
print(val.encode("utf-16"))
val.encode("utf-16le")

b'espa\xf1ol'
b'\xff\xfee\x00s\x00p\x00a\x00\xf1\x00o\x00l\x00'


b'e\x00s\x00p\x00a\x00\xf1\x00o\x00l\x00'

It is most common to encounter bytes objects in the context of working with files, where implicitly decoding all data to Unicode strings may not be desired.