Unexpected and Much Slower Performace on a Query Using 1.4 vs 1.3 #6302

jkelley-gfu · 2021-04-16T23:33:48Z

jkelley-gfu
Apr 16, 2021

Description of Problem

Some of our application's queries using sqlalchemy have gone from running in under a second and using very little RAM to making taking over a minute each and using gigabytes of ram. I expect we're doing something that violates sqlalchemy's expectation on how queries are built out and used, but I haven't able to find a clear reason for the large performance difference between 1.3 and 1.4.

I'm hoping for some clarity on if this performance difference is expected between 1.3 and 1.4 and if so, what changes I need to make in order to get things running again, since the increased time and memory usage is too much for our app or CI/CD pipeline to handle.

For background, our application is a custom built Human Resource Management System that has higher education specific features missing in most off the shelf software and most of our reports are built in a similar example to the code samples below.

Environment

I've ran the example below in a python3 docker container and my laptop running Archlinux using python 3.9. The database is a postgres db ran in a docker container.

Running the Example

Prepare a postgres db in a docker container:

docker run --rm --name timesheets_db_test -e POSTGRES_HOST_AUTH_METHOD=trust -p 5432:5432 postgesql

Create a virutalenv with 1.3 so we can confirm the original speed and resource use

python -m venv venv-1.3
source venv-1.3/bin/activate
pip install Flask Flask-SQLAlchemy sqlalchemy-utils psycopg2-binary arrow SQLAlchemy==1.3.*

Run example inside virtual environment

time python example_problem.py
Takes about 0.7s on my laptop and less than 40MB of ram

Create a virutalenv with 1.4 so we can reproduce my query problem

Deactivate 1.3 venv first if needed
python -m venv venv-1.4
source venv-1.4/bin/activate
pip install Flask Flask-SQLAlchemy sqlalchemy-utils psycopg2-binary arrow SQLAlchemy

Run example inside virtual environment

time python example_problem.py
Takes about 56s on my laptop and about 2.3GB of ram

Example Python Code

I've shortened this down quite a bit from production, my apologies if it is it is longer than needed. I'm afraid the setup is larger then the query in if __name__ == "__main__"

import flask
from flask import Flask
from flask_sqlalchemy import SQLAlchemy
from sqlalchemy.ext.mutable import MutableDict, MutableList
from sqlalchemy.dialects.postgresql import (JSONB, UUID, ARRAY)
from sqlalchemy_utils import(ArrowType)
from sqlalchemy.ext.hybrid import hybrid_property, hybrid_method
from sqlalchemy.ext.associationproxy import association_proxy

import arrow


from dataclasses import dataclass
from datetime import date, datetime, timedelta

@dataclass
class Config:

    TZ: str = 'US/Pacific'
    SQLALCHEMY_DATABASE_URI = "postgresql://postgres@localhost/postgres"

config = Config()


app = Flask(__name__)
app.config.from_object(config)



db = SQLAlchemy(app, engine_options={'json_serializer': flask.json.dumps})


#########
#
# Date Helper functions
#
#########
def to_local(column, tz=config.TZ):
    
    # UTC fix for timestamp without timezone
    result = db.func.timezone('UTC', column) 

    return db.func.timezone(tz, result)


def day_of_week(the_date, tz=config.TZ):                                         
   
    local_time = to_local(the_date, tz)                                     

    return db.func.extract('ISODOW', local_time).cast(db.Integer)



def is_weekend(the_date, tz=config.TZ):                                              

    SATURDAY = 6                                                                
    SUNDAY = 7                                                                  

    day = day_of_week(the_date, tz)                                                   

    return db.or_(day == SATURDAY, day == SUNDAY)


def is_weekday(the_date, tz=config.TZ):                                              

    return db.not_(is_weekend(the_date, tz))




def day_of_week_from_date(the_date):                                         
    
    return db.func.extract('ISODOW', the_date).cast(db.Integer)



def start_of_week_from_date(the_date, start_day): 
    day_interval = db.text("Interval '1 days'")

    start = the_date
    dow = day_of_week_from_date(the_date)                                                   
    
    before = dow < start_day
    before_math = (start_day - dow).cast(db.Integer)


    expr = (
            db.case(
                [
                  (
                    before, start + before_math.op("*")(day_interval)
                  )
                ],
                else_= (start + (start_day + 7 - dow).cast(db.Integer).op("*")(day_interval))
            ) - db.text("Interval '7 days'")
           )

    return db.func.date_trunc('day', expr)


def start_of_week_from_date_str(the_date, start_day, tz=config.TZ):
    return db.func.to_char(start_of_week_from_date(the_date, start_day), "YYYY-MM-DD")



class WorkSchedule(db.Model):

    __tablename__ = 'work_schedule'
    __table_args__ = (db.UniqueConstraint('emplid', 'empl_rcd', name='_one_schedule_per_job'),)
    
    id = db.Column(db.Integer, primary_key=True)
    
    emplid = db.Column(db.String, nullable=False)
    empl_rcd = db.Column(db.Integer, nullable=False)

    weekly_hours = db.Column(db.Numeric)



########
#
# We use materialized views on postgres to integrate with data from our ERP
#
########

class Department(db.Model):

    __tablename__ = 'department_mv'

    id = db.Column(db.Text, primary_key=True)
    deptid = db.Column(db.Text)
    name = db.Column(db.Text)
 
 

    def __repr__(self):
        return f'<Department {self.id}/{self.name}>'




class Job(db.Model):

    __tablename__ = 'job_mv'
    __table_args__ = (db.PrimaryKeyConstraint('emplid', 'empl_rcd'),
                     )


    emplid = db.Column(db.String )
    empl_rcd = db.Column(db.Integer)


    exempt = db.Column(db.Boolean)

    department_id = db.Column(db.String, db.ForeignKey("department_mv.id"))
    title = db.Column(db.String)
    
    jobcode = db.Column(db.String)
    position_number = db.Column(db.String)
    
    
    pay_group = db.Column('paygroup', db.String)
    

    standard_hours = db.Column('Standard Hours', db.Float)
    standard_hours_frequency = db.Column('Standard Hours Frequency', db.String)
    
    

    department = db.relationship('Department', viewonly=True, uselist=False)
 
    schedule = db.relationship('WorkSchedule', 
                                  backref=db.backref('job', uselist=False),
                                  viewonly=True, 
                                  uselist=False,
                                  foreign_keys=[WorkSchedule.emplid, WorkSchedule.empl_rcd],
                                  primaryjoin="and_(Job.emplid == WorkSchedule.emplid,"
                                               "Job.empl_rcd == WorkSchedule.empl_rcd,)",
                                  )

    
 

    @hybrid_property
    def id(self):
        id_ = f"{self.emplid}_{self.empl_rcd}"

        return id_

    @id.expression
    def id(cls):
        return db.func.concat(cls.emplid, '_', cls.empl_rcd)



    def __repr__(self):
        return f'<Job {self.emplid}/{self.empl_rcd}>'



#########
#
# Back to regular tables
#
#########


class PayPeriod(db.Model):

    __tablename__ = 'pay_period'

    id = db.Column(db.Integer, primary_key = True)
    
    description = db.Column(db.Text, nullable = False)

    start = db.Column(db.Date, nullable = False)
    end = db.Column(db.Date, nullable = False)
    
    open_ = db.Column('open', db.Boolean, nullable = False, default=False)
    
    
    # ISO day of week 1=Monday 7=Sunday, Needed for overtime
    start_of_week = db.Column(db.Integer, nullable = False)
    
    pay_groups = db.Column(MutableList.as_mutable(ARRAY(db.String)))
    
    

    @hybrid_property
    def is_current(self):
        today = arrow.utcnow().to(config.TZ).date()
        return today >= self.start and today <= self.end

    @is_current.expression
    def is_current(cls):
        return db.and_(db_date_now >= cls.start, db_date_now <= cls.end)

    @hybrid_property
    def is_past(self):
        today = arrow.utcnow().date()
        return today > self.end

    @is_past.expression
    def is_past(cls):
        return db_date_now > cls.end


    @hybrid_method
    def start_day_of_week(self, tz=config.TZ):
        return self.start.isoweekday()

    @start_day_of_week.expression
    def start_day_of_week(cls, tz=config.TZ):
        local_time = db.func.timezone(tz, cls.start)

        return db.func.extract('ISODOW', local_time).cast(db.Integer)
    

    @hybrid_method
    def end_day_of_week(self, tz=config.TZ):
        return self.end.isoweekday()

    @end_day_of_week.expression
    def end_day_of_week(cls, tz=config.TZ):
        local_time = db.func.timezone(tz, cls.end)

        return db.func.extract('ISODOW', local_time).cast(db.Integer)
    

 
    @hybrid_method
    def first_week(self, tz=config.TZ):
        
        week_start = None
        
        start = arrow.get(self.start, tz)
        dow = self.start_day_of_week(tz)
        start_day = self.start_of_week

        if dow < start_day:
            shift_days = start_day - dow
            week_start = start.shift(days=shift_days)
        else:
            shift_days = (start_day + 7) - dow
            week_start = start.shift(days=shift_days)
        
        week_start = week_start.shift(days=-7)
        
        return arrow.get(week_start.date(), tz) 

   


    @first_week.expression
    def first_week(cls, tz=config.TZ):
            
        day_interval = db.text("Interval '1 days'")

        start = to_local(cls.start, tz)
        dow = cls.start_day_of_week(tz)
        start_day = cls.start_of_week
        
        before = dow < start_day
        before_math = (start_day - dow).cast(db.Integer)


        expr = (
                db.case(
                    [
                      (
                        before, start + before_math.op("*")(day_interval)
                      )
                    ],
                    else_= (start + (start_day + 7 - dow).cast(db.Integer).op("*")(day_interval))
                ) - db.text("Interval '7 days'")
               )

        return db.func.date_trunc('day',
                                  db.func.timezone(tz, db.func.date_trunc('day', expr))
                                 )


    
    @hybrid_method
    def first_week_str(self, tz=config.TZ):
        return self.first_week(tz).format("YYYY-MM-DD")

    @first_week_str.expression
    def first_week_str(cls, tz=config.TZ):
        return db.func.to_char(cls.first_week(tz), "YYYY-MM-DD")


    @hybrid_method
    def last_week(self, tz=config.TZ):
        
        return start_of_week_from_date_object(self.end, self.start_of_week, tz=tz)

 
    @last_week.expression
    def last_week(cls, tz=config.TZ):
            
        return start_of_week_from_date(cls.end, cls.start_of_week)
  

    @hybrid_method
    def last_week_is_full(self, tz=config.TZ):
        
        days_in_week = arrow.get(self.end, tz) - self.last_week(tz)
        return not (days_in_week < timedelta(6))

 
    @last_week_is_full.expression
    def last_week_is_full(cls, tz=config.TZ):
 
        return ~((cls.end - cls.last_week(tz)) < timedelta(days=6)) 
  

class Timesheet(db.Model):
    
    __tablename__ = 'timesheet'
    __table_args__ = (db.UniqueConstraint('pay_period_id', 'emplid', 'empl_rcd', 'employment_id', name='_one_timesheet_per_pay_period'),
                     )


    id = db.Column(db.Integer, primary_key=True)
    
    pay_period_id = db.Column(db.Integer, db.ForeignKey(PayPeriod.id), nullable = False) 
    
    type_ = db.Column('type', db.Text, nullable=False)
    emplid = db.Column(db.Text, nullable=False) 
    empl_rcd = db.Column(db.Integer, nullable=True)
    employment_id = db.Column(db.Integer, nullable=True)
    department_id = db.Column(db.String, nullable=False)

    created_at = db.Column(ArrowType, default=arrow.utcnow)
    
    payroll_info = db.Column(MutableDict.as_mutable(JSONB), nullable = False, default = {})
    
    
    
    pay_period = db.relationship('PayPeriod', backref = "timesheets", uselist=False)
    pay_period_start = association_proxy('pay_period', 'start')
    pay_period_end = association_proxy('pay_period', 'end')
 
    
    job = db.relationship('Job',
                          backref="timesheets",
                          foreign_keys=[emplid, empl_rcd],
                          primaryjoin="and_(Job.emplid == Timesheet.emplid,"
                                       "Job.empl_rcd == Timesheet.empl_rcd)",
                          uselist=False)


    days = db.relationship('Day', backref="timesheet")

    
    department = db.relationship('Department', 
                                 foreign_keys=[department_id], 
                                 primaryjoin = 'Department.id == Timesheet.department_id')
 
         
   


class Day(db.Model):
    
    __tablename__ = 'day'
    
    id = db.Column(db.Integer, primary_key=True)
    timesheet_id = db.Column(db.Integer, db.ForeignKey(Timesheet.id), nullable=False) 
    
    date = db.Column(db.Date, nullable = False)
    
    hours = db.relationship('Hour', backref="day")


    @hybrid_property
    def total_hours(self):
          
        all_hours = sum([i.hours for i in self.hours])

        return all_hours 

    @total_hours.expression
    def total_hours(cls):
        return db.select([db.func.sum(Hour.hours)]).where(Hour.day_id==cls.id).label('total_hours')


    @hybrid_method
    def day_of_week(self, tz=config.TZ):
        """
        tz is not used since this is naive date, but is kept to keep the api
        the same with the other methods of the same name
        """
        return self.date.isoweekday()

    @day_of_week.expression
    def day_of_week(cls, tz=config.TZ):
        """
        tz is not used since this is naive date, but is kept to keep the api
        the same with the other methods of the same name
        """
        
        return db.func.extract('ISODOW', cls.date).cast(db.Integer)
  
    @hybrid_method
    def is_weekend(self, tz=config.TZ):
        SATURDAY = 6
        SUNDAY = 7
        
        day = self.day_of_week(tz)

        return day in [SATURDAY, SUNDAY]


    @is_weekend.expression
    def is_weekend(cls, tz=config.TZ):
        
        SATURDAY = 6
        SUNDAY = 7
        
        day = cls.day_of_week(tz)

        return db.or_(day == SATURDAY, day == SUNDAY)
   
    @hybrid_method
    def is_weekday(self, tz=config.TZ):

        return not self.is_weekend(tz)


    @is_weekday.expression
    def is_weekday(cls, tz=config.TZ):
        
        return db.not_(cls.is_weekend(tz))
    
     
    @hybrid_method
    def start_of_week(self, start_day, tz=config.TZ):
        week_start = None
        
        start = arrow.get(self.date, tz)
        dow = self.day_of_week(tz)

        if dow < start_day:
            shift_days = start_day - dow
            week_start = start.shift(days=shift_days)
        else:
            shift_days = (start_day + 7) - dow
            week_start = start.shift(days=shift_days)
        
        week_start = week_start.shift(days=-7)
        
        return arrow.get(week_start.date(), tz) 

    @start_of_week.expression
    def start_of_week(cls, start_day, tz=config.TZ):
        """
        tz is not used since this is naive date, but is kept to keep the api
        the same with the other methods of the same name
        """

        return start_of_week_from_date(cls.date, start_day)

    @hybrid_method
    def start_of_week_str(self, start_day, tz=config.TZ):
        return self.start_of_week(start_day, tz).format("YYYY-MM-DD")

    @start_of_week_str.expression
    def start_of_week_str(cls, start_day, tz=config.TZ):
        return db.func.to_char(cls.start_of_week(start_day, tz), "YYYY-MM-DD")


class HourType(db.Model):
    
    __tablename__ = 'hour_type'
    
    id = db.Column(db.Integer, primary_key=True)
    name = db.Column(db.Text, nullable = False)
    leave = db.Column(db.Boolean)

    #new
    leave_unpaid = db.Column(db.Boolean, nullable=False, default=False)

    day_ok = db.Column(db.Boolean, nullable=False, default=True)
    shift_ok = db.Column(db.Boolean, nullable=False, default=True)

    #new
    exempt_ok = db.Column(db.Boolean, nullable=False, default=True)
    non_exempt_ok = db.Column(db.Boolean, nullable=False, default=True)
    weight = db.Column(db.Float, nullable=False, default=10.0) # Used for sorting
    disabled = db.Column(db.Boolean, nullable=False, default=False)

    external_code = db.Column(db.Text, nullable=True)
    
    
    is_work = db.Column(db.Boolean, nullable=False, default=True)
    adds_additional_time = db.Column(db.Boolean, nullable=False, default=True)



class Hour(db.Model):
    
    __tablename__ = 'hour'
    
    id = db.Column(db.Integer, primary_key=True)
    day_id = db.Column(db.Integer, db.ForeignKey(Day.id), nullable = False) 
    hour_type_id = db.Column(db.Integer, db.ForeignKey(HourType.id), nullable = False) 
    
    hours = db.Column(db.Float, nullable = False)
    note = db.Column(db.Text, nullable = True)
    
    # Do I need to make a type model?
    type_ = db.relationship('HourType')




def setup_db():
    
    ###########
    #
    # Setup the materialized views with testing data
    #
    ###########
    
    db.session.execute("""
        drop materialized view if exists public.job_mv cascade;
        create materialized view public.job_mv as
        (
        SELECT
            *
          FROM
          (
            SELECT
              'Executive' as flsa_status,
              true as exempt,
              true as non_faculty_exempt,
              '0000101' as emplid,
              0 as empl_rcd,
              'TESTS_001' as department_id,
              'President' as title,
              'TESTS_001' as job_deptid,
              'President''s Office' as "Job Department",
              'JOB_114' as jobcode,
              'President' as "Jobcode Title",
              '10000001' as position_number,
              'President' as position,
              'President''s Office' as "Position Department",
              'TESTS_001' as position_deptid,
              'Active' as "Payroll Status",
              'SOMEPLACE' as location,
              'Full-Time' as "Full/Part Time",
              'EMP' as paygroup,
              'R' as reg_temp,
              'Active' as "Benefits Employee Status",
              'Salaried' as "Employee Type",
              1.000000 as fte,
              ' ' as reports_to,
              ' ' as supervisor_id,
              '2080.00' as "Standard Hours",
              'Annually' as  "Standard Hours Frequency"
            Union select 'Executive', true, true, '0000100' , 0, 'TESTS_002', 
                         'Vice President for Awsome', 'TESTS_002', 'Bureau of Awesome', 
                          'JOB_000', 'Executive Vice President', '10000002', 
                          'Vice President for Awesome', 'Bureau of Awesome', 'TESTS_002' , 
                          'Active', 'SOMEPLACE', 'Full-Time', 'EMP', 'R', 'Active', 
                          'Salaried', 1.000000, '10000001',' ',  
                          2080.00, 'Annually'
        ) a
        order by emplid
        ); """)
    
    
    db.session.execute(""" drop materialized view if exists public.department_mv;
        create materialized view public.department_mv as
        (
          SELECT
            *
          FROM
          (
            SELECT 'TESTS_001' AS id, 'TESTS_001' AS deptid, 'President''s Office' as name
            UNION SELECT 'TESTS_002','TESTS_002', 'Bureau of Awesome'
          ) a
          order by deptid)
        """)

    db.session.commit()
    
    # Create the tables for the Acutal Tables
    db.create_all()




if __name__ == "__main__":
    print("Example")
    
    day_week_starts = 6 # Saturday - ISO day of week
    TZ = config.TZ
    

    setup_db()
    
    # An pay period is required for our report
    pay_period = PayPeriod( description = "Current", 
                            start=arrow.utcnow().shift(days=-15).to(TZ).date(), 
                            end=arrow.utcnow().shift(days=+15).to(TZ).date(), 
                            open_=True,
                            pay_groups=['EMP'],
                            start_of_week=day_week_starts
                          )
    
    db.session.add(pay_period)
    db.session.commit()
    
    # Job record for the timesheet we are setting up
    job = Job.query.get(['0000100', 0])

    # Create the timesheet
    with db.session.no_autoflush:
        timesheet = Timesheet()
        timesheet.pay_period = pay_period
        timesheet.type_ = 'Day'
        timesheet.job = job
        timesheet.department = job.department

    db.session.add(timesheet)
    db.session.commit()
    


    ########
    #
    # The actual query that has radically different performace characteristics.
    #
    # Note: We use the query -> subquery for query reuse a lot in our app
    #
    ########
    
    # Job Standard hours (source depends on if a seperate schedule is setup
    has_schedule = db.session.query(WorkSchedule.emplid, WorkSchedule.empl_rcd) 
    job_standard_weekly_hours = db.session.query(Job.id.label('id'),                            
                                         Job.emplid.label('emplid'),                        
                                         Job.empl_rcd.label('empl_rcd'),                      
                                         Job.pay_group.label('pay_group'),                     
                                         Job.exempt.label('exempt'),                        
                                         Job.standard_hours.label('standard_hours'), 
                                         Job.standard_hours_frequency.label('standard_hours_frequency'))\
                                .filter(~db.tuple_(Job.emplid, Job.empl_rcd).in_(has_schedule.subquery()))

    
    schedule_standard_weekly_hours = db.session.query( Job.id.label('id'),
                                          Job.emplid.label('emplid'),                        
                                          Job.empl_rcd.label('empl_rcd'),                      
                                          Job.pay_group.label('pay_group'),                     
                                          Job.exempt.label('exempt'),
                                          WorkSchedule.weekly_hours.label('standard_hours'),
                                          db.literal('Weekly').label('standard_hours_frequency')
                                        ).join(WorkSchedule, db.and_(Job.emplid == WorkSchedule.emplid, 
                                                                     Job.empl_rcd == WorkSchedule.empl_rcd))
    
    # Combine the two possible standard hour sources
    standard_weekly_hours = job_standard_weekly_hours.union(schedule_standard_weekly_hours)
    standard_weekly_hours = standard_weekly_hours.filter(Job.standard_hours_frequency == 'Weekly')
    standard_weekly_hours = standard_weekly_hours.filter(Job.exempt == False)
    
    # Speed of query seems okay in both versions
    print("Standard weekly hours", standard_weekly_hours.count())


    #
    # Hours report query
    #
    hours = db.session.query(Hour.id.label('hour_id'),
                             Day.id.label('day_id'),
                             Timesheet.id.label('timesheet_id'),
                             Timesheet.emplid.label('emplid'),
                             Timesheet.empl_rcd.label('empl_rcd'),
                             PayPeriod.id.label('pay_period_id'),
                             Day.date.label('day_date'),
                             Day.is_weekend().label('is_weekend'),
                             Day.is_weekday().label('is_weekday'),
                             HourType.name.label('type'),
                             HourType.id.label('type_id'),
                             HourType.is_work.label('is_work'),
                             HourType.adds_additional_time.label('adds_additional_time'),
                             HourType.leave.label('leave'),
                             HourType.leave_unpaid.label('leave_unpaid'),
                             Hour.hours.label('hours'),
                             Day.start_of_week_str(day_week_starts, tz=TZ).label('start_of_week_str'), 
                             Day.start_of_week(day_week_starts, tz=TZ).label('start_of_week'),
                             db.and_(PayPeriod.last_week() == Day.start_of_week(day_week_starts, tz=TZ), 
                                     ~PayPeriod.last_week_is_full()).label('partial_last_week'),

                            )
    
    hours = hours.join(Hour, Hour.day_id == Day.id)\
                 .join(HourType, Hour.hour_type_id == HourType.id)\
                 .join(Timesheet, Timesheet.id == Day.timesheet_id)\
                 .join(PayPeriod, PayPeriod.id == Timesheet.pay_period_id)


    # Add padding so there is a day for every day of the pay period
    pay_period_days = db.session.query(
                         db.func.generate_series(pay_period.start, 
                                                 pay_period.end, 
                                                 timedelta(days=1)
                                                ).cast(db.Date).label('date')).subquery()


    padding_days = db.session.query(
                 db.literal(-1).label('hour_id'),
                 db.literal(-1).label('day_id'),
                 pay_period_days.c.date.label('day_date'),
                 is_weekend(pay_period_days.c.date).label('is_weekend'),
                 is_weekday(pay_period_days.c.date).label('is_weekday'),
                 db.literal('Query Padding').label('type'),
                 db.literal(-1).label('type_id'),
                 db.literal(False).label('is_work'),
                 db.literal(False).label('adds_additional_time'),
                 db.literal(True).label('leave'),
                 db.literal(False).label('leave_unpaid'),
                 db.literal(0).label('hours'),
                 start_of_week_from_date(pay_period_days.c.date, day_week_starts).label('start_of_week'),
                 start_of_week_from_date_str(pay_period_days.c.date, day_week_starts).label('start_of_week_str'),
                )

    
    timesheet_info = db.session.query(Timesheet.id.label('timesheet_id'),
                          Timesheet.emplid.label('emplid'),
                          Timesheet.empl_rcd.label('empl_rcd'),
                          Timesheet.pay_period_id.label('pay_period_id'),
                          PayPeriod.last_week().label('pay_period_last_week'),
                          PayPeriod.last_week_is_full().label('pay_period_last_week_is_full'),
                         ).join(Timesheet.pay_period).filter(Timesheet.pay_period_id == pay_period.id)

    d = padding_days.subquery()
    t = timesheet_info.subquery()
    
    # Cartesian product on purpose wo we can make sure there is a day in our
    # query for each day in the pay period.
    timesheet_padding = db.session.query(d.c.hour_id, 
                                         d.c.day_id,
                                         t.c.timesheet_id,
                                         t.c.emplid,
                                         t.c.empl_rcd,
                                         t.c.pay_period_id,
                                         d.c.day_date,
                                         d.c.is_weekend,
                                         d.c.is_weekday,
                                         d.c.type,
                                         d.c.type_id,
                                         d.c.is_work,
                                         d.c.adds_additional_time,
                                         d.c.leave,
                                         d.c.leave_unpaid,
                                         d.c.hours,
                                         d.c.start_of_week_str,
                                         d.c.start_of_week,
                                         db.and_(t.c.pay_period_last_week == d.c.start_of_week, 
                                                 ~t.c.pay_period_last_week_is_full).label('partial_last_week'),
                                        ).join(t, db.true())

    hours = hours.union(timesheet_padding)
    
    hours = hours.order_by(Timesheet.emplid, 
                       Timesheet.id, 
                       Day.start_of_week_str(day_week_starts).desc(), 
                       Day.date.desc(), 
                       HourType.leave)


    # Speed of query seems okay in both versions
    print("Hours", hours.count())

    h =  hours.subquery()
    sh = standard_weekly_hours.subquery()
    
    hours_prep = db.session.query(h.c.emplid,
                                  h.c.pay_period_id,
                                  h.c.timesheet_id,
                                  h.c.timesheet_id,
                                  h.c.empl_rcd,
                                  h.c.start_of_week_str,
                                  h.c.start_of_week,
                                  db.case(
                                    [((pay_period.end - h.c.start_of_week) < timedelta(days=6), False)],  # 6 days - 31-25 is 6 even though it is 7 days 
                                      else_=True).label('whole_week'), # If the last week of the time period was a full one
                                  h.c.is_work,
                                  h.c.adds_additional_time,
                                  h.c.leave,
                                  h.c.leave_unpaid,
                                  h.c.hour_id,
                                  h.c.hours,
                                  h.c.day_date,
                                  sh.c.standard_hours
                                 )
        
    hours_prep = hours_prep.outerjoin(sh, db.and_(sh.c.emplid == h.c.emplid,
                                                  sh.c.empl_rcd == h.c.empl_rcd)
                                     )

    
    print("Hours Prep", hours_prep.count())

    
    h = hours_prep.subquery()

    weekly = db.session.query(h.c.emplid,
                              db.func.string_agg(h.c.pay_period_id.cast(db.Text).distinct(), '; ').label('pay_period_ids'),
                              db.func.array_agg(h.c.pay_period_id.distinct()).label('pay_period_ids_list'),
                              db.func.string_agg(h.c.timesheet_id.cast(db.Text).distinct(), '; ').label('timesheet_ids'),
                              db.func.array_agg(h.c.timesheet_id.distinct()).label('timesheet_ids_list'),
                              h.c.empl_rcd,
                              h.c.start_of_week_str,
                              h.c.start_of_week,
                              h.c.whole_week,
                              h.c.standard_hours,
                              db.func.count(h.c.hour_id).label('hours_count'),
                              db.func.sum(h.c.hours).label('total_hours'),
                             (db.func.sum(db.case([(h.c.adds_additional_time == True, h.c.hours)], else_=0)) - h.c.standard_hours).label("additional"),
                             (db.func.sum(db.case([(db.and_(h.c.adds_additional_time == True, h.c.is_work == False), h.c.hours)], else_=0))).label("additional_hours_not_work"),
                             (h.c.standard_hours - db.func.sum(h.c.hours) ).label("short"),
                             (db.func.sum(db.case([(h.c.leave == True and h.c.leave_unpaid == False, h.c.hours)], else_=0))).label('paid_leave_hours'),
                             (db.func.sum(db.case([(h.c.leave_unpaid == True, h.c.hours)], else_=0))).label('unpaid_leave_hours'),
                             (db.func.sum(db.case([(h.c.is_work == True, h.c.hours)], else_=0))).label('work_hours'),
                             )
 
    weekly = weekly.filter( h.c.day_date >= pay_period.first_week(tz=TZ).date(), # not the first day of the pay period, but of first day of the first week of the pay period. 
                             h.c.day_date <= pay_period.end, 
                          )

        
    weekly = weekly.group_by(h.c.emplid,  
                             h.c.empl_rcd,
                             h.c.start_of_week_str, 
                             h.c.start_of_week, 
                             h.c.whole_week,
                             h.c.standard_hours,
                            )

    # Only include rows with short or additional time, 
    weekly = weekly.having( db.and_(
                                db.or_(
                                    (db.func.sum(h.c.hours) - h.c.standard_hours) > 0,
                                    (h.c.standard_hours - db.func.sum(h.c.hours) ) > 0
                                    )
                            )
                          )
    

    # Speed of query seems slightly slower in 1.4
    # About 2s and 120MB of ram in version 1.4 if I end at this point compared to
    # about 0.7s and about 20MB of ram
    print("Weekly", weekly.count())
    
    ####
    #
    # This seems to be what is causing the slow down and memory usage
    #
    ####

    # Clean up the output
    w = weekly.subquery()
    cleanup = db.session.query(
                                w.c.emplid,
                                w.c.empl_rcd,
                                w.c.pay_period_ids,
                                w.c.pay_period_ids_list,
                                w.c.timesheet_ids,
                                w.c.timesheet_ids_list,
                                #w.c.timesheet_jobs,
                                w.c.start_of_week_str,
                                w.c.start_of_week,
                                w.c.whole_week,
                                w.c.standard_hours,
                                w.c.hours_count,
                                w.c.total_hours,
                                w.c.work_hours,
                                db.case([(w.c.additional > 0, w.c.additional)], else_=0).label('additional'),
                                db.case([(w.c.additional_hours_not_work > 0, w.c.additional_hours_not_work)], else_=0).label('additional_hours_not_work'),
                                db.case([(w.c.additional > 0, w.c.paid_leave_hours)], else_=0).label('paid_leave_hours'),
                                db.case([(w.c.additional > 0, w.c.unpaid_leave_hours)], else_=0).label('unpaid_leave_hours'),
                                #db.case([((w.c.additional - w.c.unpaid_leave_hours) > w.c.standard_hours, (w.c.additional - w.c.unpaid_leave_hours) )], else_=0).label('additional_without_unpaid_leave_hours'),
                                db.case([ ( db.and_(w.c.unpaid_leave_hours == 0, w.c.additional > 0), w.c.additional ),
                                          ( (w.c.additional - w.c.unpaid_leave_hours) >= 0, (w.c.additional - w.c.unpaid_leave_hours) )
                                        ], else_=0).label('additional_without_unpaid_leave_hours'),
                                db.case([(w.c.short > 0, w.c.short)], else_=0).label('short'),
                              )

    # Only include whole weeks so we don't show people as having short time
    # just because it is the end of the month.
    cleanup = cleanup.filter(w.c.whole_week == True)

    print("Cleanup", cleanup.count())

Thank you for your help and I look forward to migrating to v1.4.

Answered by zzzeek

Apr 17, 2021

hi there -

this is a very long example that also uses external dependencies, are you able to provide any profiling data using the linked recipe? This will provide any details on whether there is some Python codepath that is much more intensive.

Based on the memory results it seems likely however that something is either leaking memory or using way too much of it, and the time spent could be just the Python interpreter growing to that 2G size. For the memory situation, can you try disabling caching . The main thing SQLAlchemy 1.4 does differently with memory is that it's storing query structures in a cache.

I can try running later but if you can just experiment with those two things that m…

View full answer

zzzeek · 2021-04-17T00:12:00Z

zzzeek
Apr 17, 2021
Maintainer

hi there -

this is a very long example that also uses external dependencies, are you able to provide any profiling data using the linked recipe? This will provide any details on whether there is some Python codepath that is much more intensive.

Based on the memory results it seems likely however that something is either leaking memory or using way too much of it, and the time spent could be just the Python interpreter growing to that 2G size. For the memory situation, can you try disabling caching . The main thing SQLAlchemy 1.4 does differently with memory is that it's storing query structures in a cache.

I can try running later but if you can just experiment with those two things that may provide some clues.

10 replies

zzzeek Apr 17, 2021
Maintainer

#6304 is the main issue.

zzzeek Apr 17, 2021
Maintainer

the patch to repair which I'm running through testing now is:

diff --git a/lib/sqlalchemy/sql/selectable.py b/lib/sqlalchemy/sql/selectable.py
index 2a221cdf0b..0a828c7528 100644
--- a/lib/sqlalchemy/sql/selectable.py
+++ b/lib/sqlalchemy/sql/selectable.py
@@ -4424,19 +4424,29 @@ class SelectState(util.MemoizedSlots, CompileState):
 
 
 class _SelectFromElements(object):
+
     def _iterate_from_elements(self):
         # note this does not include elements
         # in _setup_joins or _legacy_setup_joins
 
-        return itertools.chain(
-            itertools.chain.from_iterable(
-                [element._from_objects for element in self._raw_columns]
-            ),
-            itertools.chain.from_iterable(
-                [element._from_objects for element in self._where_criteria]
-            ),
-            self._from_obj,
-        )
+        seen = set()
+        for element in self._raw_columns:
+            for fr in element._from_objects:
+                if fr in seen:
+                    continue
+                seen.add(fr)
+                yield fr
+        for element in self._where_criteria:
+            for fr in element._from_objects:
+                if fr in seen:
+                    continue
+                seen.add(fr)
+                yield fr
+        for element in self._from_obj:
+            if element in seen:
+                continue
+            seen.add(element)
+            yield element
 
 
 class Select(

im going to release this ASAP however feel free to test either the patch or github master when it merges.

jkelley-gfu Apr 17, 2021
Author

Thank you so much! It's nice to have a forthcoming solution and to have been able to learn more in the process!

I will look at the difference in the generated SQL, but it may take me a little bit of time to clear my schedule to do a line by line comparison. Aside from correctness of results is there anything in particular you think I should keep an eye out for when I do that SQL review?

zzzeek Apr 17, 2021
Maintainer

not really, it's a huge query so I noticed it seemed to be differnet, id make sure it's still correct

zzzeek Apr 17, 2021
Maintainer

1.4.9 is released with that fix. im out for tonight!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected and Much Slower Performace on a Query Using 1.4 vs 1.3 #6302

{{title}}

Replies: 1 comment 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Unexpected and Much Slower Performace on a Query Using 1.4 vs 1.3 #6302

jkelley-gfu Apr 16, 2021

Description of Problem

Environment

Running the Example

Prepare a postgres db in a docker container:

Create a virutalenv with 1.3 so we can confirm the original speed and resource use

Run example inside virtual environment

Create a virutalenv with 1.4 so we can reproduce my query problem

Run example inside virtual environment

Example Python Code

Replies: 1 comment · 10 replies

zzzeek Apr 17, 2021 Maintainer

zzzeek Apr 17, 2021 Maintainer

zzzeek Apr 17, 2021 Maintainer

jkelley-gfu Apr 17, 2021 Author

zzzeek Apr 17, 2021 Maintainer

zzzeek Apr 17, 2021 Maintainer

jkelley-gfu
Apr 16, 2021

Replies: 1 comment 10 replies

zzzeek
Apr 17, 2021
Maintainer

zzzeek Apr 17, 2021
Maintainer

zzzeek Apr 17, 2021
Maintainer

jkelley-gfu Apr 17, 2021
Author

zzzeek Apr 17, 2021
Maintainer

zzzeek Apr 17, 2021
Maintainer