# Python for Data Science
## Session 4 
### Basic Libraries I

---

## Basic Libraries I

Let's jump into today's exercice.

### Exercise


Given a zip file with a subfolder with multiple annotations, where the name convention for each one of them is: 

{DATE}_{TIME}_SN{SATELLITE_NUMBER}_QUICKVIEW_VISUAL_{VERSION}_{UNIQUE_REGION}.txt

where:

- DATE expressed as YYYYMMDD (year, month and day), e.g. 20241201, 20230321 ...
- TIME expressed as HHMMSS (hour, minutes and seconds), e.g. 2134307
- SATELLITE_NUMBER an integer that represents the satellite number.
- VERSION provides the version of the pipeline, e.g. "0_1_2", "1_3_1" ...
- UNIQUE_REGION provides a unique location in the form of a string, e.g SATL-2KM-10N_552_4164

Find out the following thing about your data:

1. How many files the annotations folder has.
2. How many of them follow the name convention expressed above.
3. How many of annotations you have per month and year. Which month has more annotation files.
4. Create a new annotations folder with multiple folders corresponding to a month.
5. Print all the annotations from the most recent to the oldest one. 
6. How many different satellites there are, how many annotations we have per satellite number, and which one was used in the most recent annotation file. 
7. How many unique regions there are.

some tips:
- str class has a method called split, you can use it to get each field per annotation.
- you can use sort from numpy on strings.

---

1. How many files the annotations folder has.

In [42]:
import os 

folder = 'session_4/annotations'

count = len(os.listdir(folder))
print(f'There are {count} files in the annotations folder')

There are 206 files in the annotations folder


2. How many of them follow the name convention expressed above.

In [43]:
import re

count_matching = re.compile(r'^\d{8}_\d{6}_SN\d+_QUICKVIEW_VISUAL_\d+_\d+_\d+_[A-Za-z0-9\-_.]+\.txt$')
matches = 0

#Create a new list with all files with the correct naming convention
correct_naming = []

for x in os.listdir(folder):
    if count_matching.match(x):
        matches += 1
        correct_naming.append(x)

print(f'There are {matches} files following the name convention')

There are 194 files following the name convention


3. How many of annotations you have per month and year. Which month has more annotation files.

In [44]:
import calendar

count_month_year = {}
count_month = {}

#Iterate through new list with all files with the correct naming convention
for x in correct_naming:
    month_year = x[:6] #Naming convention is that first 6 characters = YYYYMM

    if month_year in count_month_year:
        count_month_year[month_year] += 1
    else:
        count_month_year[month_year] = 1

    month = x[4:6] #Naming convention is that 5th-6th characters = MM
    if month in count_month:
        count_month[month] += 1
    else:
        count_month[month] = 1

#Print # of annotations per month and year, in order of months
for y in sorted(count_month_year.keys()):
    year = int(y[:4])
    month = int(y[4:6])
    print(f'{calendar.month_name[month]} {year} has {count_month_year[y]} annotations') #print month name

print() #for output readability

#Print month with the most annotation files
most_annotations = max(count_month, key = count_month.get)
print(f'{calendar.month_name[int(most_annotations)]} has the most annotations files, with {count_month[most_annotations]} files')


January 2024 has 27 annotations
February 2024 has 45 annotations
March 2024 has 17 annotations
April 2024 has 25 annotations
May 2024 has 28 annotations
June 2024 has 52 annotations

June has the most annotations files, with 52 files


4. Create a new annotations folder with multiple folders corresponding to a month.

In [45]:
import shutil as sh 

new_folder = 'session_4/monthly_folder'

if not os.path.exists(new_folder):
    os.mkdir(new_folder)

for x in correct_naming:
    month = x[4:6]

    month_folder = os.path.join(new_folder, f'{month}')
    if not os.path.exists(month_folder):
        os.mkdir(month_folder)
    
    orig_path = os.path.join(folder, x)
    month_path = os.path.join(month_folder, x)
    sh.copy(orig_path, month_path)

5. Print all the annotations from the most recent to the oldest one. 

In [46]:
import numpy as np 

list = []

#Separate date_time
for x in correct_naming:
    field = x.split('_') 
    date_time = field[0] + '_' + field[1] 
    #satellite = field[2]
    list.append((date_time, x)) 

#Sort the annotations from most recent to oldest
sort = sorted(list, key = lambda y: y[0], reverse = True)

print(f'Here are the annotations from the most recent to the oldest file:')
for z in sort:
    print(z[1]) #print entire filename

Here are the annotations from the most recent to the oldest file:
20240623_215120_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-10N_596_4134.txt
20240623_215102_SN43_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_384_3750.txt
20240623_193704_SN27_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_566_3734.txt
20240619_215556_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-10N_742_4460.txt
20240619_185757_SN24_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_528_3700.txt
20240619_052401_SN30_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-52N_368_4336.txt
20240618_215539_SN31_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_452_3740.txt
20240618_215539_SN31_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_458_3756.txt
20240618_193146_SN27_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_530_3682.txt
20240617_211350_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_724_3614.txt
20240617_184443_SN24_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_702_3566.txt
20240617_052859_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-51N_730_4348.txt
20240616_213053_SN30_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_460_3792.txt
20240616_213047_SN30_QUI

6. How many different satellites there are, how many annotations we have per satellite number, and which one was used in the most recent annotation file. 

In [47]:
satellite_list = {} #To store list of unique satellites
sorted_files = [] #To store annotation files from most to least recent

for x in correct_naming:
    field = x.split('_')
    date_time = field[0] + '_' + field[1]
    satellite = field[2] 
    sorted_files.append((date_time, satellite, x)) 

    if satellite in satellite_list:
        satellite_list[satellite] += 1
    else:
        satellite_list[satellite] = 1

#Sort unique satellites
sorted_satellite = sorted(satellite_list.items())

#Sort annotation files from most to least recent
sorted_satellite_files = sorted(sorted_files, key = lambda y: y[0], reverse = True)

print(f'There are {len(satellite_list)} satellites\n')
for satellite, count in sorted_satellite: 
    print(f'Satellite {satellite}: {count} annotations')

#Print satellite in most recent annotation file
most_recent = sorted_satellite_files[0][1]

print(f'\nSatellite in most recent annotation file is {most_recent}')

There are 9 satellites

Satellite SN24: 26 annotations
Satellite SN26: 37 annotations
Satellite SN27: 29 annotations
Satellite SN28: 16 annotations
Satellite SN29: 22 annotations
Satellite SN30: 18 annotations
Satellite SN31: 19 annotations
Satellite SN33: 16 annotations
Satellite SN43: 11 annotations

Satellite in most recent annotation file is SN29


7. How many unique regions there are.

In [48]:
region = set() #no duplicates

for x in correct_naming:
    region_match = re.search(r'SATL.*?(?=\.txt)', x) #Extract and match text between SATL and before .txt
    if region_match:
            region.add(region_match.group(0))

print(f'There are {len(region)} unique regions')

There are 137 unique regions
